Spacy: nlp over tokens to annotate IOB file - python-3.x

I have a file that is annotated in IOB-format. I now appended each token of column one to a list of sentences, so that each sentence is one list of tokens. I then iterate over the list of tokens while iterating over the list of sentences. The code:
with open('/content/drive/MyDrive/Spacy/Test/annotated_tuebadz_spacy.tsv', 'w+', encoding='utf-8') as tsvfile:
wrt = csv.writer(tsvfile, delimiter='\t')
nlp = spacy.load("/content/drive/MyDrive/Spacy/model/model-best")
for sent in sent_list:
for token in sent:
doc = nlp(token)
if doc[0].ent_iob_ == "O":
label = doc[0].ent_iob_ + doc[0].ent_type_
else:
label = doc[0].ent_iob_ + "-" + doc[0].ent_type_
print(doc.text, label)
wrt.writerow((doc.text, label))
where sent_list is the list of tokenized sentences where each sentence consists of tokens. [["I", "am", "a", "robot","."],["How", "are", you", "?"]]. I want to manually compare gold annotations in the original script so I want to stick to the tokenization style.
Now, my question: the results are much lower than the scores when using the evaluate script from spacy (which would not annotate the file itself). Is the problem, that it does not use context information? How can I improve the script? Are there any suggestions? Also other strategies would be appreciated!
Thanks!

Related

Transformers get named entity prediction for words instead of tokens

This is very basic question, but I spend hours struggling to find the answer. I built NER using Hugginface transformers.
Say I have input sentence
input = "Damien Hirst oil in canvas"
I tokenize it to get
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')
tokenized = tokenizer.encode(input) #[101, 12587, 7632, 12096, 3514, 1999, 10683, 102]
Feed tokenized sentence to the model to get predicted tags for the tokens
['B-ARTIST' 'B-ARTIST' 'I-ARTIST' 'I-ARTIST' 'B-MEDIUM' 'I-MEDIUM'
'I-MEDIUM' 'B-ARTIST']
prediction comes as output from the model. It assigns tags to different tokens.
How can I recombine this data to obtain tags for words instead of tokens? So I would know that
"Damien Hirst" = ARTIST
"Oil in canvas" = MEDIUM
There are two questions here.
Annotating Token Classification
A common sequential tagging, especially in Named Entity Recognition, follows the scheme that a sequence to tokens with tag X at the beginning gets B-X and on reset of the labels it gets I-X.
The problem is that most annotated datasets are tokenized with space! For example:
[CSL] O
Damien B-ARTIST
Hirst I-ARTIST
oil B-MEDIUM
in I-MEDIUM
canvas I-MEDIUM
[SEP] O
where O indicates that it is not a named-entity, B-ARTIST is the beginning of the sequence of tokens labelled as ARTIST and I-ARTIST is inside the sequence - similar pattern for MEDIUM.
At the moment I posted this answer, there is an example of NER in huggingface documentation here:
https://huggingface.co/transformers/usage.html#named-entity-recognition
The example doesn't exactly answer the question here, but it can add some clarification. The similar style of named entity labels in that example could be as follows:
label_list = [
"O", # not a named entity
"B-ARTIST", # beginning of an artist name
"I-ARTIST", # an artist name
"B-MEDIUM", # beginning of a medium name
"I-MEDIUM", # a medium name
]
Adapt Tokenizations
With all that said about annotation schema, BERT and several other models have different tokenization model. So, we have to adapt these two tokenizations.
In this case with bert-base-uncased, the expected outcome is like this:
damien B-ARTIST
hi I-ARTIST
##rst I-ARTIST
oil B-MEDIUM
in I-MEDIUM
canvas I-MEDIUM
In order to get this done, you can go through each token in original annotation, then tokenize it and add its label again:
tokens_old = ['Damien', 'Hirst', 'oil', 'in', 'canvas']
labels_old = ["B-ARTIST", "I-ARTIST", "B-MEDIUM", "I-MEDIUM", "I-MEDIUM"]
label2id = {label: idx for idx, label in enumerate(label_list)}
tokens, labels = zip(*[
(token, label)
for token_old, label in zip(tokens_old, labels_old)
for token in tokenizer.tokenize(token_old)
])
When you add [CLS] and [SEP] in the tokens, their labels "O" must be added to labels.
With the code above, it is possible to get into a situation that a beginning tag like B-ARTIST get repeated when the beginning word splits into pieces. According to the description in huggingface documentation, you can encode these labels with -100 to be ignored:
https://huggingface.co/transformers/custom_datasets.html#token-classification-with-w-nut-emerging-entities
Something like this should work:
tokens, labels = zip(*[
(token, label2id[label] if (label[:2] != "B-" or i == 0) else -100)
for token_old, label in zip(tokens_old, labels_old)
for i, token in enumerate(tokenizer.tokenize(token_old))
])

NLP : Error/Unknown/misspelled text Detection model of a patient's medical text file

I have few patient's medical record text files which i got from the internet and i want to identify/find the files which are bad quality(misspelled words/special characters between the words/Erroneous words) and files with good quality(clean text).i want to build error detection model using text mining/NLP.
1)can someone please help me on the approach and solution for feature extraction and model selection.
2)Is there any medical corpus for medical records to identify the misspelled/Erroneous words.
If your goal is to simply correct these misspelled words to improve performance on whatever downstream task you want to do, then I can suggest a simple approach which has worked sufficiently well for me.
First tokenize your text (I recommend scispacy for medical text)
Identify possible "bad quality" words simply by the count of each unique word constructed from all the words in your corpus e.g. all words that occur <= 3 times
Add words that occur > 3 times in your corpus (we assume these are all correctly spelled) to a regular English dictionary. If your corpus is large, this is perfectly adequate for capturing medical terms. Otherwise use a medical dictionary e.g. UMLS, or https://github.com/glutanimate/wordlist-medicalterms-en to add the medical words not in a regular dictionary
Use pyspellchecker to identify the misspellings by using the Levenshtein Distance algorithm and comparing against our dictionary.
Replace the typos with what pyspellchecker thinks they should be.
A basic example:
import spacy
import scispacy
from collections import Counter
from spellchecker import SpellChecker
nlp = spacy.load('en_core_sci_md') # sciSpaCy
word_freq = Counter()
for doc in corpus:
tokens = nlp.tokenizer(doc)
tokenised_text = ""
for token in tokens:
tokenised_text = tokenised_text + token.text + " "
word_freq.update(tokenised_text.split())
infreq_words = [word for word in word_freq.keys() if word_freq[word] <= 3 and word[0].isdigit() == False]
freq_words = [word for word in word_freq.keys() if word_freq[word] > 3]
add_to_dictionary = " ".join(freq_words)
f=open("medical_dict.txt", "w+")
f.write(add_to_dictionary)
f.close()
spell = SpellChecker()
spell.distance = 1 # set the distance parameter to just 1 edit away - much quicker
spell.word_frequency.load_text_file('medical_dict.txt')
misspelled = spell.unknown(infreq_words)
misspell_dict = {}
for i, word in enumerate(misspelled):
if (word != spell.correction(word)):
misspell_dict[word] = spell.correction(word)
print(list(misspell_dict.items())[:10])
I would also recommend using regular expressions to fix any other "bad quality" words which can be systematically corrected.
You can do biobert to do contextual spelling check,
Link: https://github.com/dmis-lab/biobert

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

How should I strip these tweets of words like "the" and "I"?

I'm trying to clean up a bunch of tweets so that they can be used for k-means clustering. I've written the following code that should strip each tweet of its unwanted characters.
from nltk.corpus import stopwords
import nltk
import json
with open("/Users/titus/Desktop/trumptweets.json",'r', encoding='utf8') as f:
data = json.loads(f.readline())
tweets = []
for sentence in data:
tokens = nltk.wordpunct_tokenize(sentence['text'])
type(tokens)
text = nltk.Text(tokens)
type(text)
words = [w.lower() for w in text if w.isalpha() and w not in
stopwords.words('english') and w is not 'the']
s = " "
useful_sentence = s.join(words)
tweets.append(useful_sentence)
print(tweets)
I'm trying to remove words like "I" and "the", but for some reason I can't figure out how. If I look at the tweets after they've gone through the loop, the word "the" still occurs.
Question: How is it possible that there are still occurences of "the" and "I" in the tweets? How should I fix this?
Beware of the processing order.
Here are two test strings for you:
THIS THE REMAINS.
this the is removed
Because "THE" is not "the". You lowercase after filtering, but you should first lowercase then filter.
The bad news for you: k-means works horribly bad on noisy short text like twitter. Because it is sensitive to noise, and the TFIDF vectors need very long texts to be reliable. So carefully verify your results, they probably are not as good as they may seem in the first enthusiasm.
Have you tried lowering w in check?
words = [w.lower() for w in text if w.isalpha() and w.lower() not in
stopwords.words('english') and w.lower() is not 'the']
is (and is not) is the (reference) identity check. It compares if two variable names point to the same object in memory. Typically this is only used to compare with None, or for some other speical cases.
In your case, use the != operator or the negation of == to compare with the string "the".
See also: Is there a difference between `==` and `is` in Python?

Word2vec: add external word to every context

I'm looking for a simple "hack" to implement the following idea: I want to have a specific word appear artificially in the context of every word (the underlying goal is to try and use word2vec for supervised sentence classification).
An example is best:
Say I have the sentence: "The dog is in the garden", and a window of 1.
So we would get the following pais of (target, context):
(dog, The), (dog, is), (is, dog), (is, in), etc.
But what I would like to feed to the word2vec algo is this:
(dog, The), (dog, is), **(dog, W)**, (is, dog), (is, in), **(is, W)**, etc.,
as if my word W was in the context of every word.
where W is a word of my choosing, not in the existing vocabulary.
Is there an easy way to do this in R or python ?
I imagined you have list of sentences and list of labels for each sentence:
sentences = [
["The", "dog", "is", "in", "the", "garden"],
["The", "dog", "is", "not", "in", "the", "garden"],
]
Then you created the word-context pairs:
word_context = [("dog", "The"), ("dog", "is"), ("is", "dog"), ("is", "in") ...]
Now if for each sentence you have a label, you can add labels to context of all words:
labels = [
"W1",
"W2",
]
word_labels = [
(word, label)
for sent, label in zip(sentences, labels)
for word in sent
]
word_context += word_labels
Unless you want to keep the order in word-context pairs!
Take a look at the 'Paragraph Vectors' algorithm – implemented as the class Doc2Vec in Python gensim. In it, each text example gets an extra pseudoword that essentially floats over the full example, contributing itself to every skip-gram-like (called PV-DBOW in Paragraph Vectors) or CBOW-like (called PV-DM in Paragraph Vectors) training-context.
Also take a look at Facebook's 'FastText' paper and library. It's essentially an extension of word2vec in two different directions:
First, it has the option of learning vectors for subword fragments (chracter n-grams) so that future unknown words can get rough-guess vectors from their subwords.
Second, it has the option of trying to predict not just nearby-words during vector-training, but also known-classification-labels for the containing text example (sentence). As a result, the learned word-vectors may be better for subsequent classification of other future sentences.

Resources