I am learning NLP and I was trying to replace Spacy's default SentenceSegmenter with my custo SentenceSegmenter. While doing so, I see that my custom code is not replacing Spacy's default.
Note : Spacy == 3.4.1
Below is my code:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
#Language.component("component")
def changeSentenceSegmenter(doc):
for token in doc:
if token.text=="\n":
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe('component', before='parser')
nlp.pipe_names
mystring = nlp(u"This is a sentence. This is another.\n\nThis is a\nthird sentence.")
for sent in mystring.sents:
print(sent)
The output for above code is :
However, my desired output is :
By default, is_sentence_start is None. Your component is setting it to True for some tokens, but not modifying it for others. When the parser runs, for any tokens where the value is unset, it will set a value, and it may create new sentences that way. In this example it looks like that's what's happening.
If you want your component to be the only thing that sets sentence boundaries, set is_sent_start to True or False for every token.
Also note there is one open bug related to this behaviour, so it's possible for the parser to overwrite settings when it shouldn't, though it usually doesn't come up. In particular, if you set a value for every token, or just set True for some tokens, it shouldn't come up.
Related
I am using this to add stopwords to the spacy's list of stopwords
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
However, when I save the nlp object using nlp.to_disk() and load it back again with nlp.from_disk(),
I am loosing the list of custom stop words.
Is there a way to save the custom stopwords with the nlp model?
Thanks in advance
Most language defaults (stop words, lexical attributes, and syntax iterators) are not saved with the model.
If you want to customize them, you can create a custom language class, see: https://spacy.io/usage/linguistic-features#language-subclass. An example copied from this link:
from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
nlp1 = English()
nlp2 = CustomEnglish()
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
In the documentation on text generation (https://huggingface.co/transformers/main_classes/model.html#generative-models) there is the option to put
bad_words_ids (List[int], optional) – List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
Is there also the option to put something along the lines of "allowed_words_ids"? The idea would be to restrict the language of the generated texts.
I'd also suggest to do what Sahar Mills said. You can do it in the following way.
You get the whole vocab of the model you are using, e.g.
from transformers import AutoTokenizer
# Load tokenizer
checkpoint = "CenIA/distillbert-base-spanish-uncased" #Example model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100] # to see the first 100 words
Define words you do want in the model.
words_to_delete = ['forzado', 'vendieron', 'verticales'] # or load them from somewhere else
Define function to create the bad_words_ids, that is, the whole model vocab minus the words you want in the model
def create_bad_words_ids(bad_words_ids, words_to_delete):
for pictogram in range(len(words_to_delete)):
if words_to_delete[pictogram] in bad_words_ids:
bad_words_ids.remove(words_to_delete[pictogram])
return bad_words_ids
bad_words_ids = create_bad_words_ids(bad_words_ids=bad_words_ids, words_to_delete=words_to_delete)
print(bad_words_ids)
Hope it helps,
cheers
I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)
Spacy automatically tokenizes word contractions such as "dont" and "don't" into "do" and "nt"/"n't". For instance, a sentence like "I dont understand" would be tokenized into: ["I", "do", "nt", "understand"].
I understand this is usually helpful in many NLP tasks, but is there a way to suppress this special tokenization rule in Spacy such that the result is ["I", "dont", "understand"] instead?
This is because I am trying to evaluate the performance (f1-score for BIO tagging scheme) of my custom Spacy NER model, and the mismatch in the number of tokens in the input sentence and the number of predicated token tags is causing problems for my evaluation code down the line:
Input (3 tokens): [("I", "O"), ("dont", "O"), ("understand", "O")]
Predicted (4 tokens): [("I", "O"), ("do", "O"), ("nt", "O"), ("understand", "O")]
Of course, if anyone has any suggestions for a better way to perform evaluation on sequential tagging tasks in Spacy (perhaps like the seqeval package but more compatible with Spacy's token format), that would be greatly appreciated as well.
The special-case tokenization rules are defined in the tokenizer_exceptions.py in the respective language data (see here for the English "nt" contractions). When you create a new Tokenizer, those special case rules can be passed in via the rules argument.
Approach 1: Custom tokenizer with different special case rules
So one thing you could do for your use case is to reconstruct the English Tokenizer with the same prefix, suffix and infix rules, but with only a filtered set of tokenizer exceptions. Tokenizer exceptions are keyed by the string, so you could remove the entries for "dont" and whatever else you need. However, the code is quite verbose, since you're reconstructing the whole tokenizer:
from spacy.lang.en import English
from spacy.lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from spacy.lang.en import TOKENIZER_EXCEPTIONS
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
prefix_re = compile_prefix_regex(TOKENIZER_PREFIXES).search
suffix_re = compile_suffix_regex(TOKENIZER_SUFFIXES).search
infix_re = compile_infix_regex(TOKENIZER_INFIXES).finditer
filtered_exc = {key: value for key, value in TOKENIZER_EXCEPTIONS.items() if key not in ["dont"]}
nlp = English()
tokenizer = Tokenizer(
nlp.vocab,
prefix_search=prefix_re,
suffix_search=suffix_re,
infix_finditer=infix_re,
rules=filtered_exc
)
nlp.tokenizer = tokenizer
doc = nlp("I dont understand")
Approach 2: Merge (or split) tokens afterwards
An alternative aproach would be to keep the tokenization as it is, but add rules on top that merge certain tokens back together aftwards to match the desired tokenization. This is obviously going to be slower at runtime, but it might be easier to implement and reason about, because you can approach it from the perspective of "Which tokens are currently separated but should be one?". For this, you could use the rule-based Matcher and the retokenizer to merge the matched tokens back together. As of spaCy v2.1, it also supports splitting, in case that's relevant.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "do"}, {"LOWER": "nt"}]]
matcher.add("TO_MERGE", None, *patterns)
doc = nlp("I dont understand")
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
span = doc[start:end]
retokenizer.merge(span)
The above pattern would match two tokens (one dict per token), whose lowercase forms are "do" and "nt" (e.g. "DONT", "dont", "DoNt"). You can add more lists of dicts to the patterns to describe other sequences of tokens. For each match, you can then create a Span and merge it into one token. To make this logic more elegant, you could also wrap it as a custom pipeline component, so it's applied automatically when you call nlp on a text.
I'm trying to do data enhancement with a FAQ dataset. I change words, specifically nouns, by most similar words with Wordnet checking the similarity with Spacy. I use multiple for loop to go through my dataset.
import spacy
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
nlp = spacy.load('en_core_web_md')
nltk.download('wordnet')
questions = pd.read_csv("FAQ.csv")
list_questions = []
for question in questions.values:
list_questions.append(nlp(question[0]))
for question in list_questions:
for token in question:
treshold = 0.5
if token.pos_ == 'NOUN':
wordnet_syn = wn.synsets(str(token), pos=wn.NOUN)
for syn in wordnet_syn:
for lemma in syn.lemmas():
similar_word = nlp(lemma.name())
if similar_word.similarity(token) != 1. and similar_word.similarity(token) > treshold:
good_word = similar_word
treshold = token.similarity(similar_word)
However, the following warning is printed several times and I don't understand why :
UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
It is my similar_word.similarity(token) which creates the problem but I don't understand why.
The form of my list_questions is :
list_questions = [Do you have a paper or other written explanation to introduce your model's details?, Where is the BERT code come from?, How large is a sentence vector?]
I need to check token but also the similar_word in the loop, for example, I still get the error here :
tokens = nlp(u'dog cat unknownword')
similar_word = nlp(u'rabbit')
if(similar_word):
for token in tokens:
if (token):
print(token.text, similar_word.similarity(token))
You get that error message when similar_word is not a valid spacy document. E.g. this is a minimal reproducible example:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat')
#similar_word = nlp(u'rabbit')
similar_word = nlp(u'')
for token in tokens:
print(token.text, similar_word.similarity(token))
If you change the '' to be 'rabbit' it works fine. (Cats are apparently just a fraction more similar to rabbits than dogs are!)
(UPDATE: As you point out, unknown words also trigger the warning; they will be valid spacy objects, but not have any word vector.)
So, one fix would be to check similar_word is valid, including having a valid word vector, before calling similarity():
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat')
similar_word = nlp(u'')
if(similar_word and similar_word.vector_norm):
for token in tokens:
if(token and token.vector_norm):
print(token.text, similar_word.similarity(token))
Alternative Approach:
You could suppress the particular warning. It is W008. I believe setting an environmental variable SPACY_WARNING_IGNORE=W008 before running your script would do it. (Not tested.)
(See source code)
By the way, similarity() might cause some CPU load, so is worth storing in a variable, instead of calculating it three times as you currently do. (Some people might argue that is premature optimization, but I think it might also make the code more readable.)
I have suppress the W008 warning by setting environmental variable by using this code in run file.
import os
app = Flask(__name__)
app.config['SPACY_WARNING_IGNORE'] = "W008"
os.environ["SPACY_WARNING_IGNORE"] = "W008"
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)