Custom NER Spacy throwing IndexError: list index out of range - python-3.x

I am trying to create custom NER using Spacy but, while training, I am getting the following error:
gold = GoldParse(doc, entities=entity_offsets)
File "gold.pyx", line 565, in spacy.gold.GoldParse.init
IndexError: list index out of range
Any idea as to how I can fix this?
The most common resolution that came up after doing some google search was to trim leading and trailing white spaces in the training data. So I used this code to trim them off. But still was of no use.
'''
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
'''

Ah, so the words would be a keyword argument on the GoldParse object. This lets you specify the gold-standard tokenization, if it doesn't match spaCy's tokenization. Assuming your input looks like this:
text = 'helloworld'
words = ['hello', 'world']
tags = ['INTJ', 'NOUN']
You can do the following:
doc = Doc(text)
gold = GoldParse(doc, words=words, tags=tags)
nlp.update([doc], [gold])
Alternatively, you can also use the new "simple training style" and just pass in the text as a string, and the annotations as a dictionary:
nlp.update([text], [{'words': words, 'tags': tags}])
In general, we'd recommend using the simple style, as it removes one layer of abstraction, and lets you get rid of the Doc and GoldParse imports. But ultimately, the style you choose depends on your personal preference.

Related

How to merge entities in spaCy via rules

I want to use some of the entities in spaCy 3's en_core_web_lg, but replace some of what it labeled as 'ORG' as 'ANALYTIC', as it treats the 3 char codes I want to use such as 'P&L' and 'VaR' as organizations. The model has DATE entities, which I'm fine to preserve. I've read all the docs, and it seems like I should be able to use the EntityRuler, with the syntax below, but I'm not getting anywhere. I have been through the training 2-3x now, read all the Usage and API docs, and I just don't see any examples of working code. I get all sorts of different error messages like I need a decorator, or other. Lord, is it really that hard?
my code:
analytics = [
[{'LOWER':'risk'}],
[{'LOWER':'pnl'}],
[{'LOWER':'p&l'}],
[{'LOWER':'return'}],
[{'LOWER':'returns'}]
]
matcher = Matcher(nlp.vocab)
matcher.add("ANALYTICS", analytics)
doc = nlp(text)
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Create a Span with the label for "ANALYTIC"
span = Span(doc, start, end, label="ANALYTIC")
# Overwrite the doc.ents and add the span
doc.ents = list(doc.ents) + [span]
# Get the span's root head token
span_root_head = span.root.head
# Print the text of the span root's head token and the span text
print(span_root_head.text, "-->", span.text)
This of course crashes when my new 'ANALYTIC' entity span collides with the existing 'ORG' one. But I have no idea how to either merge these offline and put them back, or create my own custom pipeline using rules. This is the suggested text from the entity ruler. No clue.
# Construction via add_pipe
ruler = nlp.add_pipe("entity_ruler")
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)
So when you say it "crashes", what's happening is that you have conflicting spans. For doc.ents specifically, each token can only be in at most one span. In your case you can fix this by modifying this line:
doc.ents = list(doc.ents) + [span]
Here you've included both the old span (that you don't want) and the new span. If you get doc.ents without the old span this will work.
There are also other ways to do this. Here I'll use a simplified example where you always want to change items of length 3, but you can modify this to use your list of specific words or something else.
You can directly modify entity labels, like this:
for ent in doc.ents:
if len(ent.text) == 3:
ent.label_ = "CHECK"
print(ent.label_, ent, sep="\t")
If you want to use the EntityRuler it would look like this:
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents":True})
patterns = [
{"label": "ANALYTIC", "pattern":
[{"ENT_TYPE": "ORG", "LENGTH": 3}]}]
ruler.add_patterns(patterns)
text = "P&L reported amazing returns this year."
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent, sep="\t")
One more thing - you don't say what version of spaCy you're using. I'm using spaCy v3 here. The way pipes are added changed a bit in v3.

How to add stop words from Tfidvectorizer?

I am trying to add stop words into my stop_word list, however, the code I am using doesn't seem to be working:
Creating stop words list:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords.extend(CustomListofWordstoExclude)
Here I am converting the text to a dtm (document term matrix) with tfidf weighting:
vect = TfidfVectorizer(stop_words = 'english', min_df=150, token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape
But when I do this, I get this error:
FutureWarning: Pass input=None as keyword args. From version 0.25 passing these as positional arguments will result in an error
warnings.warn("Pass {} as keyword args. From version 0.25 "
What does this mean? Is there an easier way to add stopwords?
I'm unable to reproduce the warning. However, note that a warning such as this does not mean that your code did not run as intended. It means that in future releases of the package it may not work as intended. So if you try the same thing next year with updated packages, it may not work.
With respect to your question about using stop words, there are two changes that need to be made for your code to work as you expect.
list.extend() extends the list in-place, but it doesn't return the list. To see this you can do type(stopwords1) which gives NoneType. To define a new variable and add the custom words list to stopwords in one line, you could just use the built-in + operator functionality for lists:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords + CustomListofWordstoExclude
To actually use stopwords1 as your new stopwords list when performing the TF-IDF vectorization, you need to pass stop_words=stopwords1:
vect = TfidfVectorizer(stop_words=stopwords1, # Passed stopwords1 here
min_df=150,
token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape

Finding the Start and End char indices in Spacy

I am training a custom model in Spacy to extract custom entities but while I need to provide an input train data that consists of my entities along with the index locations, I wanted to understand if there's a faster way to assign the index value for keywords I am looking for in a particular sentence in my training data.
An example of my traning data:
TRAIN_DATA = [
('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance,
{'entities': [(25, 37, 'BS'),(40, ,60, 'BS'),(62, 79, 'BS')]
})
]
Now to pass the index location for specific keywords in my training data, I am presently counting it manually to give the location of my keyword.
For example: in case of the first line where I am saying Behaviour skills include Communication etc, I am manually calculating the location of the index for the word "Communication" which is 25,37.
I am sure there must be another way to identify the location of these indices by some other methods instead counting it manually. Any ideas how can I achieve this?
Using str.find() can help here. However, you have to loop through both sentences and keywords
keywords = ['Communication', 'Conflict Resolution', 'Work Life Balance']
texts = ['Behaviour Skills include Communication, Conflict Resolution, Work Life Balance',
'Some sentence where lower case conflict resolution is included']
LABEL = 'BS'
TRAIN_DATA = []
for text in texts:
entities = []
t_low = text.lower()
for keyword in keywords:
k_low = keyword.lower()
begin = t_low.find(k_low) # index if substring found and -1 otherwise
if begin != -1:
end = begin + len(keyword)
entities.append((begin, end, LABEL))
TRAIN_DATA.append((text, {'entities': entities}))
Output:
[('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance',
{'entities': [(25, 38, 'BS'), (40, 59, 'BS'), (61, 78, 'BS')]}),
('Some sentence where lower case conflict resolution is included',
{'entities': [(31, 50, 'BS')]})]
I added str.lower() just in case you might need it.

NLP : Error/Unknown/misspelled text Detection model of a patient's medical text file

I have few patient's medical record text files which i got from the internet and i want to identify/find the files which are bad quality(misspelled words/special characters between the words/Erroneous words) and files with good quality(clean text).i want to build error detection model using text mining/NLP.
1)can someone please help me on the approach and solution for feature extraction and model selection.
2)Is there any medical corpus for medical records to identify the misspelled/Erroneous words.
If your goal is to simply correct these misspelled words to improve performance on whatever downstream task you want to do, then I can suggest a simple approach which has worked sufficiently well for me.
First tokenize your text (I recommend scispacy for medical text)
Identify possible "bad quality" words simply by the count of each unique word constructed from all the words in your corpus e.g. all words that occur <= 3 times
Add words that occur > 3 times in your corpus (we assume these are all correctly spelled) to a regular English dictionary. If your corpus is large, this is perfectly adequate for capturing medical terms. Otherwise use a medical dictionary e.g. UMLS, or https://github.com/glutanimate/wordlist-medicalterms-en to add the medical words not in a regular dictionary
Use pyspellchecker to identify the misspellings by using the Levenshtein Distance algorithm and comparing against our dictionary.
Replace the typos with what pyspellchecker thinks they should be.
A basic example:
import spacy
import scispacy
from collections import Counter
from spellchecker import SpellChecker
nlp = spacy.load('en_core_sci_md') # sciSpaCy
word_freq = Counter()
for doc in corpus:
tokens = nlp.tokenizer(doc)
tokenised_text = ""
for token in tokens:
tokenised_text = tokenised_text + token.text + " "
word_freq.update(tokenised_text.split())
infreq_words = [word for word in word_freq.keys() if word_freq[word] <= 3 and word[0].isdigit() == False]
freq_words = [word for word in word_freq.keys() if word_freq[word] > 3]
add_to_dictionary = " ".join(freq_words)
f=open("medical_dict.txt", "w+")
f.write(add_to_dictionary)
f.close()
spell = SpellChecker()
spell.distance = 1 # set the distance parameter to just 1 edit away - much quicker
spell.word_frequency.load_text_file('medical_dict.txt')
misspelled = spell.unknown(infreq_words)
misspell_dict = {}
for i, word in enumerate(misspelled):
if (word != spell.correction(word)):
misspell_dict[word] = spell.correction(word)
print(list(misspell_dict.items())[:10])
I would also recommend using regular expressions to fix any other "bad quality" words which can be systematically corrected.
You can do biobert to do contextual spelling check,
Link: https://github.com/dmis-lab/biobert

Convert everything in a dictionary to lower case, then filter on it?

import pandas as pd
import nltk
import os
directory = os.listdir(r"C:\...")
x = []
num = 0
for i in directory:
x.append(pd.read_fwf("C:\\..." + i))
x[num] = x[num].to_string()
So, once I have a dictionary x = [ ] populated by the read_fwf for each file in my directory:
I want to know how to make it so every single character is lowercase. I am having trouble understanding the syntax and how it is applied to a dictionary.
I want to define a filter that I can use to count for a list of words in this newly defined dictionary, e.g.,
list = [bus, car, train, aeroplane, tram, ...]
Edit: Quick unrelated question:
Is pd_read_fwf the best way to read .txt files? If not, what else could I use?
Any help is very much appreciated. Thanks
Edit 2: Sample data and output that I want:
Sample:
The Horncastle boar's head is an early seventh-century Anglo-Saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. It was discovered in 2002 by a metal detectorist searching
in the town of Horncastle, Lincolnshire. It was reported as found
treasure and acquired for £15,000 by the City and County Museum, where
it is on permanent display.
Required output - changes everything in uppercase to lowercase:
the horncastle boar's head is an early seventh-century anglo-saxon
ornament depicting a boar that probably was once part of the crest of
a helmet. it was discovered in 2002 by a metal detectorist searching
in the town of horncastle, lincolnshire. it was reported as found
treasure and acquired for £15,000 by the city and county museum, where
it is on permanent display.
You shouldn't need to use pandas or dictionaries at all. Just use Python's built-in open() function:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Use the string's lower() method to make everything lowercase
text = text.lower()
print(text)
# Split text by whitespace into list of words
word_list = text.split()
# Get the number of elements in the list (the word count)
word_count = len(word_list)
print(word_count)
If you want, you can do it in the reverse order:
# Open a file in read mode with a context manager
with open(r'C:\path\to\you\file.txt', 'r') as file:
# Read the file into a string
text = file.read()
# Split text by whitespace into list of words
word_list = text.split()
# Use list comprehension to create a new list with the lower() method applied to each word.
lowercase_word_list = [word.lower() for word in word_list]
print(word_list)
Using a context manager for this is good since it automatically closes the file for you as soon as it goes out of scope (de-tabbed from with statement block). Otherwise you would have to use file.open() and file.read().
I think there are some other benefits to using context managers, but someone please correct me if I'm wrong.
I think what you are looking for is dictionary comprehension:
# Python 3
new_dict = {key: val.lower() for key, val in old_dict.items()}
# Python 2
new_dict = {key: val.lower() for key, val in old_dict.iteritems()}
items()/iteritems() gives you a list of tuples of the (keys, values) represented in the dictionary (e.g. [('somekey', 'SomeValue'), ('somekey2', 'SomeValue2')])
The comprehension iterates over each of these pairs, creating a new dictionary in the process. In the key: val.lower() section, you can do whatever manipulation you want to create the new dictionary.

Resources