Spacy - Tokenize quoted string - python-3.x

I am using spacy 2.0 and using a quoted string as input.
Example string
"The quoted text 'AA XX' should be tokenized"
and expecting to extract
[The, quoted, text, 'AA XX', should, be, tokenized]
I however get some strange results while experimenting. Noun chunks and ents looses one of the quote.
import spacy
nlp = spacy.load('en')
s = "The quoted text 'AA XX' should be tokenized"
doc = nlp(s)
print([t for t in doc])
print([t for t in doc.noun_chunks])
print([t for t in doc.ents])
Result
[The, quoted, text, ', AA, XX, ', should, be, tokenized]
[The quoted text 'AA XX]
[AA XX']
What is the best way to address what I need

While you could modify the tokenizer and add your own custom prefix, suffix and infix rules that exclude quotes, I'm not sure this is the best solution here.
For your use case, it might make more sense to add a component to your pipeline that merges (certain) quoted strings into one token before the tagger, parser and entity recognizer are called. To accomplish this, you can use the rule-based Matcher and find combinations of tokens surrounded by '. The following pattern looks for one or more alphanumeric characters:
pattern = [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}]
Here's a visual example of the pattern in the interactive matcher demo. To do the merging, you can then set up the Matcher, add the pattern and write a function that takes a Doc object, extracts the matched spans and merges them into one token by calling their .merge method.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matcher.add('QUOTED', None, [{'ORTH': "'"}, {'IS_ALPHA': True, 'OP': '+'}, {'ORTH': "'"}])
def quote_merger(doc):
# this will be called on the Doc object in the pipeline
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans: # merge into one token after collecting all matches
span.merge()
return doc
nlp.add_pipe(quote_merger, first=True) # add it right after the tokenizer
doc = nlp("The quoted text 'AA XX' should be tokenized")
print([token.text for token in doc])
# ['The', 'quoted', 'text', "'AA XX'", 'should', 'be', 'tokenized']
For a more elegant solution, you can also refactor the component as a reusable class that sets up the matcher in its __init__ method (see the docs for examples).
If you add the component first in the pipeline, all other components like the tagger, parser and entity recognizer will only get to see the retokenized Doc. That's also why you might want to write more specific patterns that only merge certain quoted strings you care about. In your example, the new token boundaries improve the predictions – but I can also think of many other cases where they don't, especially if the quoted string is longer and contains a significant part of the sentence.

Related

Spacy Extracting mentions with Matcher

everyone I am trying to match a sentence into a bigger sentence using Spacy rule-matcher, but the output is empty.
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc1 = nlp("DUMMY TEXT CafeA is very generous with the portions. DUMMY TEXT DUMMY TEXT")
pattern = [{"ENT_TYPE": "ORG"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}]
matcher = Matcher(nlp.vocab)
matcher.add("mentions",[pattern])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])
The idea of the rule is to match the "CafeA is very generous with the portions." bit, but I do not get any result. What is the correct way to do this in spacy?
Any help will be appreciated
Your code produces 2 6 CafeA is very generous when I run it on my machine (Windows 11 64-bit, Python 3.10.9, spaCy 3.4.4) using both the en_core_web_md and en_core_web_trf pipelines. As a side note, "CafeA" is not tagged as an organisation when using en_core_web_sm, and therefore the pattern does not match.
If you want include "with the portions", you'll need to expand the pattern to include the appropriate PoS tags (i.e. ADP (adposition), DET (determiner) and NOUN (noun) respectively). For example:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_md")
doc1 = nlp("DUMMY TEXT CafeA is very generous with the portions. DUMMY TEXT DUMMY TEXT")
pattern = [{"ENT_TYPE": "ORG"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}, {"POS": "ADP"}, {"POS": "DET"}, {"POS": "NOUN"}]
for tok in doc1:
print(tok.text, tok.pos_)
print(tok.ent_type_)
matcher = Matcher(nlp.vocab)
matcher.add("mentions", [pattern])
for mid, start, end in matcher(doc1):
print(start, end, doc1[start:end])
Matches and prints the following:
2 9 CafeA is very generous with the portions
If you're getting unexpected results with the matcher, try restarting your IDE or clearing any cached/stored variables.

how to extract a PERSON named entity after certain word with spacy?

I have this text ( text2 in code), it has 3 'by' word, I want to use Spacy to extract the person's name (full name, even if it is 3 words, some races use long names, in this case 2). The code is below, my pattern shows error. My intention: first fix the 'by' word with ORTH, then to tell program that whatever coming next is the Part of Speech entity called PERSON. I would be happy if anyone help it:
import spacy
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
def extract_person(nlp_doc):
pattern = [{'ORTH': 'by'}, {'POS': 'NOUN'}}]
# second possible pattern:
#pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
matcher.add('person_only', None, pattern)
matches = matcher(nlp_doc)
for match_id, start, end in matches:
span = nlp_doc[start:end]
return span.text
target_doc = nlp(text2)
extract_person(target_doc)
I think this question can be asked other way around: how to use NER tags in pattern in Matcher in spacy?
If you want to use whole names you should merge entities at the beginning. You can do it by calling: nlp.add_pipe("merge_entities", after="ner")
Then in your pattern instead of:
pattern = [{"TEXT": "by"}, {"NER": "PERSON"}]
Use:
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
Complete code:
nlp.add_pipe("merge_entities", after="ner")
text2 = 'All is done by Emily Muller, the leaf is burned by fire. we were not happy, so we cut relations by saying bye bye'
doc = nlp(text2)
pattern = [{"TEXT": "by"}, {"ENT_TYPE": "PERSON"}]
matcher = Matcher(nlp.vocab)
matcher.add('person_only', [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
print(doc[start:end])

Spacy, matcher with entities spanning more than a single token

I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to capture entities that span more than one token.
As an example, let's say that my custom entities are animals (and are labeled as token.ent_type_ = "animal")
["cat", "dog", "artic fox"] (note that the last entity has two words).
Now I want to find those entities in the text but negated, so I can create a simple matcher with the following pattern:
[{'lower': 'no'}, {'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
And for example, I have the following text:
There is no cat in the house and no artic fox in the basement
I can successfully capture no cat and no artic, but the last match is incorrect as the full match should be no artic fox. This is due to the OP: '+' in the pattern that matches a single custom entity instead of two. How can I modify the pattern to prioritize longer matches over shorter ones?
A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
animal = ["cat", "dog", "artic fox"]
ruler = EntityRuler(nlp)
for a in animal:
ruler.add_patterns([{"label": "animal", "pattern": a}])
nlp.add_pipe(ruler)
doc = nlp("There is no cat in the house and no artic fox in the basement")
with doc.retokenize() as retokenizer:
for ent in doc.ents:
retokenizer.merge(doc[ent.start:ent.end])
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
matcher.add('negated animal', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(span)
the output is now:
no cat
no artic fox

Is there a way to turn off specific built-in tokenization rules in Spacy?

Spacy automatically tokenizes word contractions such as "dont" and "don't" into "do" and "nt"/"n't". For instance, a sentence like "I dont understand" would be tokenized into: ["I", "do", "nt", "understand"].
I understand this is usually helpful in many NLP tasks, but is there a way to suppress this special tokenization rule in Spacy such that the result is ["I", "dont", "understand"] instead?
This is because I am trying to evaluate the performance (f1-score for BIO tagging scheme) of my custom Spacy NER model, and the mismatch in the number of tokens in the input sentence and the number of predicated token tags is causing problems for my evaluation code down the line:
Input (3 tokens): [("I", "O"), ("dont", "O"), ("understand", "O")]
Predicted (4 tokens): [("I", "O"), ("do", "O"), ("nt", "O"), ("understand", "O")]
Of course, if anyone has any suggestions for a better way to perform evaluation on sequential tagging tasks in Spacy (perhaps like the seqeval package but more compatible with Spacy's token format), that would be greatly appreciated as well.
The special-case tokenization rules are defined in the tokenizer_exceptions.py in the respective language data (see here for the English "nt" contractions). When you create a new Tokenizer, those special case rules can be passed in via the rules argument.
Approach 1: Custom tokenizer with different special case rules
So one thing you could do for your use case is to reconstruct the English Tokenizer with the same prefix, suffix and infix rules, but with only a filtered set of tokenizer exceptions. Tokenizer exceptions are keyed by the string, so you could remove the entries for "dont" and whatever else you need. However, the code is quite verbose, since you're reconstructing the whole tokenizer:
from spacy.lang.en import English
from spacy.lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from spacy.lang.en import TOKENIZER_EXCEPTIONS
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
prefix_re = compile_prefix_regex(TOKENIZER_PREFIXES).search
suffix_re = compile_suffix_regex(TOKENIZER_SUFFIXES).search
infix_re = compile_infix_regex(TOKENIZER_INFIXES).finditer
filtered_exc = {key: value for key, value in TOKENIZER_EXCEPTIONS.items() if key not in ["dont"]}
nlp = English()
tokenizer = Tokenizer(
nlp.vocab,
prefix_search=prefix_re,
suffix_search=suffix_re,
infix_finditer=infix_re,
rules=filtered_exc
)
nlp.tokenizer = tokenizer
doc = nlp("I dont understand")
Approach 2: Merge (or split) tokens afterwards
An alternative aproach would be to keep the tokenization as it is, but add rules on top that merge certain tokens back together aftwards to match the desired tokenization. This is obviously going to be slower at runtime, but it might be easier to implement and reason about, because you can approach it from the perspective of "Which tokens are currently separated but should be one?". For this, you could use the rule-based Matcher and the retokenizer to merge the matched tokens back together. As of spaCy v2.1, it also supports splitting, in case that's relevant.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "do"}, {"LOWER": "nt"}]]
matcher.add("TO_MERGE", None, *patterns)
doc = nlp("I dont understand")
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
span = doc[start:end]
retokenizer.merge(span)
The above pattern would match two tokens (one dict per token), whose lowercase forms are "do" and "nt" (e.g. "DONT", "dont", "DoNt"). You can add more lists of dicts to the patterns to describe other sequences of tokens. For each match, you can then create a Span and merge it into one token. To make this logic more elegant, you could also wrap it as a custom pipeline component, so it's applied automatically when you call nlp on a text.

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

Resources