Training and evaluating spaCy model by sentences or paragraphs - nlp

Observation:
Paragraph: I love apple. I eat one banana a day
Sentence: I love apple., I eat one banana a day
There are two sentences in this paragraph, I love apple and I eat one banana a day. If I put the whole paragraph into spaCy, it'll recognize only one entity, for example, apple, but if I put the sentences in paragraph one by one, spaCy can recognize two entities, apple and banana.(This is just an example to show my point, the actual recognition result could be different)
Situation:
After having trained a model by myself, I want to evaluate the recognizing accuracy of my model, there are two ways to pass the text into the spaCy model:
1. split the paragraph into sentences and pass the sentence one by one
for sentence in paragraph:
doc = nlp(sentence)
# retrieve the parsing result
2. pass the paragraph at once
doc = nlp(paragraph)
# retrieve the parsing result
Question:
I'm wondering which way would be better to test the performance of the model? Since I'm sure passing by sentence can always recognize more entities than passing by paragraph.
If the second one is better, do I also need to change the way that I trained the model? Currently, I train the spacy model sentence by sentence rather than a paragraph.
The goal of my project:
After getting a document, recognize all the entities that I'm interested in the document.
Thanks!

Related

Find a sentence is related to a medical term or not

Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated
I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)
OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.

How to add an tokenizer exception for whitespaces in Spacy language models

The following is my code where I take an user input.
import en_core_web_sm
nlp = en_core_web_sm.load()
text = input("please enter your text or words here")
doc = nlp(text)
print([t.text for t in doc])
If the user input the text as Deep Learning, the text is broken into
('Deep', 'Learning')
How to add an whitespace exception in nlp? such that the output is like below
(Deep Learning)
The raw text from the user input is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
So if your user types in: Looking for Deep Learning experts
It will be tokenized as: ('Looking', 'for, 'Deep', 'Learning', 'experts')
Spacy does not know that Deep Learning is an entity on it's own. If you want spaCy to recognize Deep Learning as a single entity, you need to teach it. If you have a predefined list of words that you would want spaCy to recognize as a single entity, you can use PhraseMatcher to do that.
You can check the details on how to use PhraseMatcher here
UPDATE - Reply to OP's comment below
I do not think there is a way spaCy can know about the entity you are looking for without being trained in the context of your domain or being provided a predefined subset of the entities.
The only solution I can think of is to use an annotation tool to teach spaCy
- Take a subset of your user inputs and annotate them manually (you can use the prodigy tool by the makers of spaCy or Brat - it's free)
- Use the annotations to train a new or existing NER model. Details on training a model can be found [here](here
Given a text like "Looking for Deep Learning experts", you would annotate "Deep Learning" with a label such as "FIELD". Then train a new entity type, 'FIELD'.
Once you have trained the model in the context, spaCy will learn to detect entities of interest.

To check if a string of words is a sentence

I have a text file from which I have to eliminate all the statements which do not make any meaning or in other words, I have to check for a statement that if it is a sentence or not.
For example:
1. John is a heart patient.
2. Dr. Green, Rob is the referring doctor for the patient.
3. Jacob Thomas, M.D. is the ordering provider
4. Xray Shoulder PA, Oblique, TRUE Lateral, 18° FOSSA LAT LT; Status: Complete;
The sentence 1,2, ad 3 makes some meaning
but sentence 4 does not make any meaning, so I want to eliminate it.
May I know how it could be done?
This task seems very difficult; however, assuming you have the training data, you could likely use XGBoost, which uses boosted decision trees (and random forests). You would train it to answer positive or negative (yes is makes sense, or no).
You would then need to come up with features. You could use the features from the NLTK part of speech (POS) tags. The number of occurrences of each of the types of tags in the sentence would be a good first model. That can set your benchmark for how good an "easy" solution is.
You also may be able to look into the utility of a (word/sentence)-to-vector model such as gensim for creating features for your model.
First I would see what happens with just the number of occurrences of each POS tag and XGBOOST. Train and test a model and see how well it does. Then look to adding other features such as position or using a doc-2-vec as your input to XGBoost.
Last resort would be a neural network (which would only be recommended if the prior ideas fail, and you have lots and lots of data). If you did use a neural net I would think an LSTM would likely be useful.
You would have to experiment and the amount of data matters, but you can start simple and then test and add to your model iteratively.
It's very hard to be 100% confident but let's try.
I can use Amazon Comprehend - Natural Language Processing and Text Analytics and create your own metrics over the sentences. ex:
John is a heart patient.
Amazon will give you: "." Punctuation, "a" Determiner, "heart" Noun, "is" verb, "John" Proper Noun, "patient" Noun.
1 Punctuation, 1 Determiner, 2 Noun, 1 Verb, 1 Proper Noun. Probably you will have Noun and verd to have a valid sentence.
In Your last sentence we have:
3 Punctuation, 1 Numeral, 11 Proper noun. You dont have a action (verb) probably these sentense isn't valid.

nlp- Difference between Sentences and a Document in Stanford OpenNLP?

Let us say we have an article that we want to annotate. If we input the text as one really long Sentence as opposed to a Document, does Stanford do anything differently between annotating that one long Sentence as opposed to looping through every Sentence in the Document and culminating all of its results together?
EDIT: I ran a test and it seems like the two approaches return two different NER sets. I might be just doing it wrong, but it's certainly super interesting and I'm curious as to why this happens.
To confirm: you mean Stanford CoreNLP (as opposed to Apache OpenNLP), right?
The main difference in the CoreNLP Simple API between a Sentence and a Document is tokenization. A Sentence will force the entire text to be considered as a single sentence, even if it has punctuation. A Document will first tokenize the text into a list of sentences, and then annotate each sentence.
Note that for annotators like the constituency parser, very long sentences will take prohibitively long to annotate. Also, note that coreference only works on documents, not sentences.

Detecting content based on position in sentence with OpenNLP

I've successfully used OpenNLP for document categorization and also was able to extract names from trained samples and using regular expressions.
I was wondering if it is also possible to extract names (or more generally speaking, subjects) based on their position in a sentence?
E.g. instead of training with concrete names that are know a priori, like Travel to <START:location> New York </START>, I would prefer not to provide concrete examples but let OpenNLP decide that anything appearing at the specified position could be an entity. That way, I wouldn't have to provide each and every possible option (which is impossible in my case anyway) but only provide one for the possible surrounding sentence.
that is context based learning and Opennlp already does that. you've to train it with proper and more examples to get good results.
for example, when there is Professor X in our sentence, Opennlp trained model.bin gives you output X as a name whereas when X is present in the sentence without professor infront of it, it might not give output X as a name.
according to its documentation, give 15000 sentences of training data and you can expect good results.

Resources