Customize spacy stop words and save the model - python-3.x

I am using this to add stopwords to the spacy's list of stopwords
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}
However, when I save the nlp object using nlp.to_disk() and load it back again with nlp.from_disk(),
I am loosing the list of custom stop words.
Is there a way to save the custom stopwords with the nlp model?
Thanks in advance

Most language defaults (stop words, lexical attributes, and syntax iterators) are not saved with the model.
If you want to customize them, you can create a custom language class, see: https://spacy.io/usage/linguistic-features#language-subclass. An example copied from this link:
from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
nlp1 = English()
nlp2 = CustomEnglish()
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])

Related

How to replace spacy SentenceSegmenter with custom SentenceSegmenter

I am learning NLP and I was trying to replace Spacy's default SentenceSegmenter with my custo SentenceSegmenter. While doing so, I see that my custom code is not replacing Spacy's default.
Note : Spacy == 3.4.1
Below is my code:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
#Language.component("component")
def changeSentenceSegmenter(doc):
for token in doc:
if token.text=="\n":
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe('component', before='parser')
nlp.pipe_names
mystring = nlp(u"This is a sentence. This is another.\n\nThis is a\nthird sentence.")
for sent in mystring.sents:
print(sent)
The output for above code is :
However, my desired output is :
By default, is_sentence_start is None. Your component is setting it to True for some tokens, but not modifying it for others. When the parser runs, for any tokens where the value is unset, it will set a value, and it may create new sentences that way. In this example it looks like that's what's happening.
If you want your component to be the only thing that sets sentence boundaries, set is_sent_start to True or False for every token.
Also note there is one open bug related to this behaviour, so it's possible for the parser to overwrite settings when it shouldn't, though it usually doesn't come up. In particular, if you set a value for every token, or just set True for some tokens, it shouldn't come up.

How to generate sentence embedding using long-former model

I am using Hugging Face mrm8488/longformer-base-4096-finetuned-squadv2 pre-trained model
https://huggingface.co/mrm8488/longformer-base-4096-finetuned-squadv2.
I want to generate sentence level embedding. I have a data-frame which has a text column.
I am using this code:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(question, text, return_tensors="pt")
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
input_ids = encoding["input_ids"]
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]
How can I do modification in the above code to generate embedding for sentences. ?
I have the following examples:
Text
i've added notes to the claim and it's been escalated for final review
after submitting the request you'll receive an email confirming the open request.
hello my name is person and i'll be assisting you
this is sam and i'll be assisting you for date.
I'll return the amount as asap.
ill return it to you.
The Longformer uses a local attention mechanism and you need to pass a global attention mask to let one token attend to all tokens of your sequence.
import torch
from transformers import LongformerTokenizer, LongformerModel
ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"
tokenizer = LongformerTokenizer.from_pretrained(ckpt)
model = LongformerModel.from_pretrained(ckpt)
text = "Huggingface has democratized NLP. Huge thanks to Huggingface for this." # I will pas text-column here from my data-frame
#question = "What has Huggingface done ?"
encoding = tokenizer(text, return_tensors="pt")
global_attention_mask = [1].extend([0]*encoding["input_ids"].shape[-1])
encoding["global_attention_mask"] = global_attention_mask
# I don't want to use it for Question-Answer use-case. I just need the sentence embeddings
# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
o = model(**encoding)
sentence_embedding = o.last_hidden_state[:,0]
You should keep in mind that mrm8488/longformer-base-4096-finetuned-squadv2 was not pre-trained to produce meaningful sentence embeddings and faces the same issues as the MLM pre-trained BERT's regarding sentence embeddings.

Whitelist tokens for text generation (XLNet, GPT-2) in huggingface-transformers

In the documentation on text generation (https://huggingface.co/transformers/main_classes/model.html#generative-models) there is the option to put
bad_words_ids (List[int], optional) – List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
Is there also the option to put something along the lines of "allowed_words_ids"? The idea would be to restrict the language of the generated texts.
I'd also suggest to do what Sahar Mills said. You can do it in the following way.
You get the whole vocab of the model you are using, e.g.
from transformers import AutoTokenizer
# Load tokenizer
checkpoint = "CenIA/distillbert-base-spanish-uncased" #Example model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100] # to see the first 100 words
Define words you do want in the model.
words_to_delete = ['forzado', 'vendieron', 'verticales'] # or load them from somewhere else
Define function to create the bad_words_ids, that is, the whole model vocab minus the words you want in the model
def create_bad_words_ids(bad_words_ids, words_to_delete):
for pictogram in range(len(words_to_delete)):
if words_to_delete[pictogram] in bad_words_ids:
bad_words_ids.remove(words_to_delete[pictogram])
return bad_words_ids
bad_words_ids = create_bad_words_ids(bad_words_ids=bad_words_ids, words_to_delete=words_to_delete)
print(bad_words_ids)
Hope it helps,
cheers

spaCy issue with 'Vocab' or 'StringStore'

I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)

Is there a way to turn off specific built-in tokenization rules in Spacy?

Spacy automatically tokenizes word contractions such as "dont" and "don't" into "do" and "nt"/"n't". For instance, a sentence like "I dont understand" would be tokenized into: ["I", "do", "nt", "understand"].
I understand this is usually helpful in many NLP tasks, but is there a way to suppress this special tokenization rule in Spacy such that the result is ["I", "dont", "understand"] instead?
This is because I am trying to evaluate the performance (f1-score for BIO tagging scheme) of my custom Spacy NER model, and the mismatch in the number of tokens in the input sentence and the number of predicated token tags is causing problems for my evaluation code down the line:
Input (3 tokens): [("I", "O"), ("dont", "O"), ("understand", "O")]
Predicted (4 tokens): [("I", "O"), ("do", "O"), ("nt", "O"), ("understand", "O")]
Of course, if anyone has any suggestions for a better way to perform evaluation on sequential tagging tasks in Spacy (perhaps like the seqeval package but more compatible with Spacy's token format), that would be greatly appreciated as well.
The special-case tokenization rules are defined in the tokenizer_exceptions.py in the respective language data (see here for the English "nt" contractions). When you create a new Tokenizer, those special case rules can be passed in via the rules argument.
Approach 1: Custom tokenizer with different special case rules
So one thing you could do for your use case is to reconstruct the English Tokenizer with the same prefix, suffix and infix rules, but with only a filtered set of tokenizer exceptions. Tokenizer exceptions are keyed by the string, so you could remove the entries for "dont" and whatever else you need. However, the code is quite verbose, since you're reconstructing the whole tokenizer:
from spacy.lang.en import English
from spacy.lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES
from spacy.lang.en import TOKENIZER_EXCEPTIONS
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex
prefix_re = compile_prefix_regex(TOKENIZER_PREFIXES).search
suffix_re = compile_suffix_regex(TOKENIZER_SUFFIXES).search
infix_re = compile_infix_regex(TOKENIZER_INFIXES).finditer
filtered_exc = {key: value for key, value in TOKENIZER_EXCEPTIONS.items() if key not in ["dont"]}
nlp = English()
tokenizer = Tokenizer(
nlp.vocab,
prefix_search=prefix_re,
suffix_search=suffix_re,
infix_finditer=infix_re,
rules=filtered_exc
)
nlp.tokenizer = tokenizer
doc = nlp("I dont understand")
Approach 2: Merge (or split) tokens afterwards
An alternative aproach would be to keep the tokenization as it is, but add rules on top that merge certain tokens back together aftwards to match the desired tokenization. This is obviously going to be slower at runtime, but it might be easier to implement and reason about, because you can approach it from the perspective of "Which tokens are currently separated but should be one?". For this, you could use the rule-based Matcher and the retokenizer to merge the matched tokens back together. As of spaCy v2.1, it also supports splitting, in case that's relevant.
from spacy.lang.en import English
from spacy.matcher import Matcher
nlp = English()
matcher = Matcher(nlp.vocab)
patterns = [[{"LOWER": "do"}, {"LOWER": "nt"}]]
matcher.add("TO_MERGE", None, *patterns)
doc = nlp("I dont understand")
matches = matcher(doc)
with doc.retokenize() as retokenizer:
for match_id, start, end in matches:
span = doc[start:end]
retokenizer.merge(span)
The above pattern would match two tokens (one dict per token), whose lowercase forms are "do" and "nt" (e.g. "DONT", "dont", "DoNt"). You can add more lists of dicts to the patterns to describe other sequences of tokens. For each match, you can then create a Span and merge it into one token. To make this logic more elegant, you could also wrap it as a custom pipeline component, so it's applied automatically when you call nlp on a text.

Resources