I am facing problem to detect named entities which starts with lowercase letter. If I train the model with only lowercase words, then the accuracy is reasonable; however, when the model is trained with fully uppercase tokens or even mix of lowercase and uppercase, the result is very bad. I tried some features which presented by the Stanford NLP Group Class NERFeatureFactory as well as variety of sentences, but I could not get the results that I expected.
An example for the problem I am facing is as follow:
"ali studied at university of michigan and now he works for us navy."
I expected the model to recognize entities as follow:
"university" : "FACILITY",
"of michigan" : "FACILITY",
"ali" : "PERSON"
"us" : "ORGANIZATION"
"navy" : "ORGANIZATION"
If the .TSV file, which used as training data, contains ONLY lowercase letters, then I can get the above result otherwise the result is surprising.
Any help is highly appreciated a head.
If you have lowercase text or mixed case text, the accuracy can get affected as the Stanford NLP models are trained on standardly edited data, but there are a couple of useful ways to approach this problem:
One way is to correctly capitalize the text with a true case annotator, and then process the resulting text with the regular NER model.
Another way is to explore caseless models including ones that are available as part of Stanford NER.
You can read more here.
Related
Input: user enters a sentence
if the word is related to any medical term , or if he needs any medical attention,
Output=True
else
Output=False
I am reading https://www.nltk.org/. I scraped 'https://www.merriam-webster.com/browse/medical/a' this website to get the medical related words but I am unable to figure out how to detect the sentence which are related to medical term . I haven't done any code because the algorithm is not clear to me.
I want to know what should I use , where to start, I need a tutorial link to implement this thing. Any guidance will be highly appreciated
I will list down the various ways you can do this with naive to intelligent ways -
Get a large vocabulary of medical terms, iterate over the sentence and return yes or no incase you find anything
Get a large vocabulary of medical terms, iterate over the sentence and do a fuzzy match with each word, so that words that are variations of the same work syntactically (alphabetically) are still detected and caught. [Check fuzzywuzzy library in python]
Get a large vocabulary of medical terms with definitions for each. Use pre-trained word embeddings (word2vec, Glove etc) for each word in the descriptions of those terms. Take a weighted sum of each word embeddings with weights set to the TFIDF of each word, to represent each medical term (its description to be precise) as a vector. Repeat the process for the sentence as well. Then take a cosine similary between them to calculate how contextually similar is the text to the description of the medical term. If the similarity is above a certain threshold that you fix, then return True. [This approach doesnt need the exact term, even if the person is talking about the condition, it should be able to detect]
Label a large number of sentences with their respective medical terms in them (annotate using something like the API.AI entity annotation tool or RASA entity annotation tool). Create a neural network with input embedding layer (which you can initialise with word2vec embeddings if you like), bi-LSTM layers and output with the list of medical terms / conditions with softmax. This will get you probability of each condition or term being associated with the sentence.
Create a neural network with encoder decoder architecture with attention layer between them. Create encoder embeddings from the input sentence. Create decoder with output as a string of medical terms. Train an encoder-decoder attention layer with pre-annotated data.
Create a pointer network which as input takes a sentence with the respective medical terms and return pointers, which point back to the inputs and marks them as medical term or non-medical term. (not easy to build fyi...)
OK so, I don't understand which part do you not understand? Because, the idea is rather simple and one google search gives you great and easy results. Unless the issue is that you don't know python. In that case it will be very hard for you to implement this.
The idea itself is simple - tokenize sentence (have each word for itself in a list) and search the list of medical terms. If the current word is in the list, the term is medical so the sentence is related to that medical term as well. If you imagine that you have a list of medical terms in a medical_terms list then in python it would look something like this:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthurs' abdomen was hurting."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthurs', 'abdomen', "was", 'hurting', '.']
>>> def is_medical(tokens):
... for i in tokens:
... if i in medical_terms:
... return True
... else:
... return False
>>> is_medical(tokens)
True
You just tokenize the input sentence with NLTK and then search the list if any of the words in the sentence are medical terms. You can adapt this function to work with n-grams as well. This has a lot of other approaches and different special cases that have to be handled by this is the good start.
The following is my code where I take an user input.
import en_core_web_sm
nlp = en_core_web_sm.load()
text = input("please enter your text or words here")
doc = nlp(text)
print([t.text for t in doc])
If the user input the text as Deep Learning, the text is broken into
('Deep', 'Learning')
How to add an whitespace exception in nlp? such that the output is like below
(Deep Learning)
The raw text from the user input is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
So if your user types in: Looking for Deep Learning experts
It will be tokenized as: ('Looking', 'for, 'Deep', 'Learning', 'experts')
Spacy does not know that Deep Learning is an entity on it's own. If you want spaCy to recognize Deep Learning as a single entity, you need to teach it. If you have a predefined list of words that you would want spaCy to recognize as a single entity, you can use PhraseMatcher to do that.
You can check the details on how to use PhraseMatcher here
UPDATE - Reply to OP's comment below
I do not think there is a way spaCy can know about the entity you are looking for without being trained in the context of your domain or being provided a predefined subset of the entities.
The only solution I can think of is to use an annotation tool to teach spaCy
- Take a subset of your user inputs and annotate them manually (you can use the prodigy tool by the makers of spaCy or Brat - it's free)
- Use the annotations to train a new or existing NER model. Details on training a model can be found [here](here
Given a text like "Looking for Deep Learning experts", you would annotate "Deep Learning" with a label such as "FIELD". Then train a new entity type, 'FIELD'.
Once you have trained the model in the context, spaCy will learn to detect entities of interest.
I'm trying to tokenize sentences using spacy.
The text includes lots of abbreviations and comments which ends with a period. Also, the text was obtained with OCR and sometimes there are line breaks in the middle of sentences. Spacy doesn't seem to be performing so well in these situations.
I have extracted some examples of how I want these sentences to be split. Is there any way to train spacy's sentence tokenizer?
Spacy is a little unusual in that the default sentence segmentation comes from the dependency parser, so you can't train a sentence boundary detector directly as such, but you can add your own custom component to the pipeline or pre-insert some boundaries that the parser will respect. See their documentation with examples: Spacy Sentence Segmentation
For the cases you're describing it would potentially be useful also be able to specify that a particular position is NOT a sentence boundary, but as far as I can tell that's not currently possible.
I want to use NER(CRF classifier) to identify Author names in a query. I trained NER following the method given in nlp.stanford.edu site using the training file:training-data.col. And tested using the file:testing-data.tsv.
The NER is tagging every input as Author, even the data that is tagged as non-Author in the training data. Can anyone tell me why NER is tagging the non-Authors in training data as Authors and how to train NER to identify Authors(I have the list of Author names to train).
Any suggestions for reference material on NER other than nlp.stanford.edu site will be helpful.
That's a very small piece of training data, so I'm not surprised that it made the wrong inferences. Since the only example it has seen of "Atal" is as Author, it's tagging "Atal" as such.
But more so, if you want to discriminate between people listed at the beginning as Author and people listed in the text as 0, Stanford NER is not going to do that. Stanford NER is intended to make long distance inferences about the named-entity tags of tokens in natural language text. In other words, it's doing the opposite of what you're trying to do.
You could probably do this with some simple pattern recognition---if your documents are formatted in a similar way, with the authors together, I would start with exploiting that. You could use the NER to tag the authors as PERSON, and then use that tag as a feature in your own tagging.
I am using Stanford NER classification as part of a PHI De-identification process running on laboratory text notes. I am noticing that in some cases, the classification tags e.g <PERSON></PERSON> tags can find a person name, but then continue to tag much more text either side of the found name. This loss of precision means that we could potentially lose a lot of non-PHI and valuable info. Is there a way to prepare text in such a way that entities are more precisely discovered?