Sentence split using spacy sentenizer - python-3.x

I am using spaCy's sentencizer to split the sentences.
from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)
text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
print(sents_list)
print([token.text for token in doc])
OUTPUT
['Please read the analysis. (',
"You'll be amazed.)"]
['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be',
'amazed', '.', ')']
Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.

I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).
Below custom boundaries only works with sm model and behave different splitting with lg model.
nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ".(" or token.text == ").":
doc[token.i+1].is_sent_start = True
elif token.text == "Rs." or token.text == ")":
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
for sent in doc.sents:
print(sent.text)

The sentencizer is a very fast but also very minimal sentence splitter that's not going to have good performance with punctuation like this. It's good for splitting texts into sentence-ish chunks, but if you need higher quality sentence segmentation, use the parser component of an English model to do sentence segmentation.

Related

How to know which contextual embedding to use at test time

Models like BERT generate contextual embeddings for words with different contextual meanings, like 'bank', 'left'.
I don't understand which contextual embedding the model chooses to use at test time? Given a test sentence for classification, when we load the pre-trained bert, how do we initialize the word (token) embedding to use the right contextual embedding over the other embeddings of the same word?
More specifically, there is a convert_to_id() function which converts a word to an id? how does one id represent the correct contextual embedding for the input sentence at test time? Thank you.
I searched all over online but only found explanation about the difference between static vs. contextual embedding, the high level concept is easy to get, but how is that really achieved is unclear. I also search some code example, but the convert_to_id() makes me further confused as I asked in my question.
TL;DR There's only one embedding for the word "left." There's also no way to know which meaning the word has if the sequence is only one word. BERT uses the representation of it's start-of-sequence token (i.e., [CLS]) to represent each sequence, and that representation will differ depending on what context the word "left" is used in.
Given your example of text classification, the input sentence is first tokenized using the WordPiece tokenizer, and the [CLS] token's representation is fed to a feedforward layer for classification.
You can't really debate context when given single words, so I'll use two different sentences:
I left my house this morning.
You'll see my house on your left.
The steps to performing text classification typically are:
Tokenize your input text and receive the necessary input_ids and attention_masks.
Feed this tokenized input into your model and receive the outputs.
Feed the [CLS] token's representation to a classifier layer (typically a feedforward network).
The two sentences are tokenized to (using bert-base-uncased):
['[CLS]', 'i', 'left', 'my', 'house', 'this', 'morning', '.', '[SEP]']
['[CLS]', 'you', "'", 'll', 'see', 'my', 'house', 'on', 'your', 'left', '.', '[SEP]']
The [CLS] token's representation for each sentence will be different because the sentences have different words (i.e., contexts). The result is therefore different.
>>> from transformers import AutoModel, AutoTokenizer
>>> bert = AutoModel.from_pretrained('bert-base-uncased')
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> sentence1 = "I left my house this morning."
>>> sentence2 = "You'll see my house on your left."
>>> sentence1_inputs = tokenizer(sentence1, return_tensors='pt')
>>> sentence2_inputs = tokenizer(sentence2, return_tensors='pt')
>>> sentence1_outputs = bert(**sentence1_inputs)
>>> sentence2_outputs = bert(**sentence2_inputs)
>>> cls1 = sentence1_outputs[1]
>>> cls2 = sentence2_outputs[1]
>>> print(cls1[0][:5])
... tensor([-0.8832, -0.3484, -0.8044, 0.6536, 0.6409], grad_fn=<SliceBackward0>)
>>> print(cls2[0][:5])
... tensor([-0.8791, -0.4069, -0.8994, 0.7371, 0.7010], grad_fn=<SliceBackward0>)

finding the POS of the root of a noun_chunk with spacy

When using spacy you can easily loop across the noun_phrases of a text as follows:
S='This is an example sentence that should include several parts and also make clear that studying Natural language Processing is not difficult'
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentence', 'parts', 'Processing']
How can I get the POS of every of those words (even if looks like the root of a noun_phrase is always a noun), and how can I get the lemma, the shape and the word in singular of that particular word.
Is that even possible?
thx.
Each chunk.root is a Token where you can get different attributes including lemma_ and pos_ (or tag_ if you prefer the PennTreekbak POS tags).
import spacy
S='This is an example sentence that should include several parts and also make ' \
'clear that studying Natural language Processing is not difficult'
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
for chunk in doc.noun_chunks:
print('%-12s %-6s %s' % (chunk.root.text, chunk.root.pos_, chunk.root.lemma_))
sentence NOUN sentence
parts NOUN part
Processing NOUN processing
BTW... In this sentence "processing" is a noun so the lemma of it is "processing", not "process" which is the lemma of the verb "processing".

Unexpected lemmatize result from gensim

I used following code lemmatize texts that were already excluding stop words and kept words longer than 3. However, after using following code, it split existing words such as 'wheres' to ['where', 's']; 'youre' to ['-PRON-','be']. I didn't expect 's', '-PRON-', 'be' these results in my text, what caused this behaviour and what I can do?
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc]) # though rare, if only keep the tokens with given posttags, add 'if token.pos_ in allowed_postags'
return texts_out
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

Lemmatization of words using spacy and nltk not giving correct lemma

I want to get the lemmatized words of the words in list given below:
(eg)
words = ['Funnier','Funniest','mightiest','tighter']
When I do spacy,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
I got the lemmas like:
Funnier
Funniest
mighty
tight
When I go for nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
I got:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
Anyone help for this.
Thanks.
Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.
Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().
Sometimes, the same word can have a multiple lemmas based on the meaning / context.
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
For the above example(), specify the corresponding pos tag:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

Resources