Unexpected lemmatize result from gensim - nlp

I used following code lemmatize texts that were already excluding stop words and kept words longer than 3. However, after using following code, it split existing words such as 'wheres' to ['where', 's']; 'youre' to ['-PRON-','be']. I didn't expect 's', '-PRON-', 'be' these results in my text, what caused this behaviour and what I can do?
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc]) # though rare, if only keep the tokens with given posttags, add 'if token.pos_ in allowed_postags'
return texts_out
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

Related

Spacy: nlp over tokens to annotate IOB file

I have a file that is annotated in IOB-format. I now appended each token of column one to a list of sentences, so that each sentence is one list of tokens. I then iterate over the list of tokens while iterating over the list of sentences. The code:
with open('/content/drive/MyDrive/Spacy/Test/annotated_tuebadz_spacy.tsv', 'w+', encoding='utf-8') as tsvfile:
wrt = csv.writer(tsvfile, delimiter='\t')
nlp = spacy.load("/content/drive/MyDrive/Spacy/model/model-best")
for sent in sent_list:
for token in sent:
doc = nlp(token)
if doc[0].ent_iob_ == "O":
label = doc[0].ent_iob_ + doc[0].ent_type_
else:
label = doc[0].ent_iob_ + "-" + doc[0].ent_type_
print(doc.text, label)
wrt.writerow((doc.text, label))
where sent_list is the list of tokenized sentences where each sentence consists of tokens. [["I", "am", "a", "robot","."],["How", "are", you", "?"]]. I want to manually compare gold annotations in the original script so I want to stick to the tokenization style.
Now, my question: the results are much lower than the scores when using the evaluate script from spacy (which would not annotate the file itself). Is the problem, that it does not use context information? How can I improve the script? Are there any suggestions? Also other strategies would be appreciated!
Thanks!

Sentence split using spacy sentenizer

I am using spaCy's sentencizer to split the sentences.
from spacy.lang.en import English
nlp = English()
sbd = nlp.create_pipe('sentencizer')
nlp.add_pipe(sbd)
text="Please read the analysis. (You'll be amazed.)"
doc = nlp(text)
sents_list = []
for sent in doc.sents:
sents_list.append(sent.text)
print(sents_list)
print([token.text for token in doc])
OUTPUT
['Please read the analysis. (',
"You'll be amazed.)"]
['Please', 'read', 'the', 'analysis', '.', '(', 'You', "'ll", 'be',
'amazed', '.', ')']
Tokenization is done correctly but I am not sure it's not splitting the 2nd sentence along with ( and taking this as an end in the first sentence.
I have tested below code with en_core_web_lg and en_core_web_sm model and performance for sm model are similar to using sentencizer. (lg model will hit the performance).
Below custom boundaries only works with sm model and behave different splitting with lg model.
nlp=spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text == ".(" or token.text == ").":
doc[token.i+1].is_sent_start = True
elif token.text == "Rs." or token.text == ")":
doc[token.i+1].is_sent_start = False
return doc
nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
for sent in doc.sents:
print(sent.text)
The sentencizer is a very fast but also very minimal sentence splitter that's not going to have good performance with punctuation like this. It's good for splitting texts into sentence-ish chunks, but if you need higher quality sentence segmentation, use the parser component of an English model to do sentence segmentation.

How to get token ids using spaCy (I want to map a text sentence to sequence of integers)

I want to use spacy to tokenize sentences to get a sequence of integer token-ids that I can use for downstream tasks. I expect to use it something like below. Please fill in ???
import spacy
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_lg')
# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
idxs = ??????
print(idxs)
# Want output to be something like;
>> array([ 8045, 70727, 24304, 96127, 44091, 37596, 24524, 35224, 36253])
Preferably the integers refers to some special embedding id in en_core_web_lg..
spacy.io/usage/vectors-similarity does not give a hint what attribute in doc to look for.
I asked this on crossvalidated but it was determined as OT. Proper terms for googling/describing this problem is also helpful.
Spacy uses hashing on texts to get unique ids. All Token objects have multiple forms for different use cases of a given Token in a Document
If you just want the normalised form of the Tokens then use the .norm attribute which is a integer representation of the text (hashed)
>>> import spacy
>>> nlp = spacy.load('en')
>>> text = "here is some test text"
>>> doc = nlp(text)
>>> [token.norm for token in doc]
[411390626470654571, 3411606890003347522, 7000492816108906599, 1618900948208871284, 15099781594404091470]
You can also use other attributes such as the lowercase integer attribute .lower or many other things. Use help() on the Document or Token to get more information.
>>> help(doc[0])
Help on Token object:
class Token(builtins.object)
| An individual token – i.e. a word, punctuation symbol, whitespace,
| etc.
|
...
Solution;
import spacy
nlp = spacy.load('en_core_web_md')
text = (u"When Sebastian Thrun started working on self-driving cars at ")
doc = nlp(text)
ids = []
for token in doc:
if token.has_vector:
id = nlp.vocab.vectors.key2row[token.norm]
else:
id = None
ids.append(id)
print([token for token in doc])
print(ids)
#>> [When, Sebastian, Thrun, started, working, on, self, -, driving, cars, at]
#>> [71, 19994, None, 369, 422, 19, 587, 32, 1169, 1153, 41]
Breaking this down;
# A Vocabulary for which __getitem__ can take a chunk of text and returns a hash
nlp.vocab
# >> <spacy.vocab.Vocab at 0x12bcdce48>
nlp.vocab['hello'].norm # hash
# >> 5983625672228268878
# The tensor holding the word-vector
nlp.vocab.vectors.data.shape
# >> (20000, 300)
# A dict mapping hash -> row in this array
nlp.vocab.vectors.key2row
# >> {12646065887601541794: 0,
# >> 2593208677638477497: 1,
# >> ...}
# So to get int id of 'earth';
i = nlp.vocab.vectors.key2row[nlp.vocab['earth'].norm]
nlp.vocab.vectors.data[i]
# Note that tokens have hashes but may not have vector
# (Hence no entry in .key2row)
nlp.vocab['Thrun'].has_vector
# >> False

How to detokenize spacy text without doc context?

I have a sequence to sequence model trained on tokens formed by spacy's tokenization. This is both encoder and decoder.
The output is a stream of tokens from a seq2seq model. I want to detokenize the text to form natural text.
Example:
Input to Seq2Seq: Some text
Output from Seq2Seq: This does n't work .
Is there any API in spacy to reverse tokenization done by rules in its tokenizer?
Internally spaCy keeps track of a boolean array to tell whether the tokens have trailing whitespace. You need this array to put the string back together. If you're using a seq2seq model, you could predict the spaces separately.
James Bradbury (author of TorchText) was complaining to me about exactly this. He's right that I didn't think about seq2seq models when I designed the tokenization system in spaCy. He developed revtok to solve his problem.
Basically what revtok does (if I understand correctly) is pack two extra bits onto the lexeme IDs: whether the lexeme has an affinity for a preceding space, and whether it has an affinity for a following space. Spaces are inserted between tokens whose lexemes both have space affinity.
Here's the code to find these bits for a spaCy Doc:
def has_pre_space(token):
if token.i == 0:
return False
if token.nbor(-1).whitespace_:
return True
else:
return False
def has_space(token):
return token.whitespace_
The trick is that you drop a space when either the current lexeme says "no trailing space" or the next lexeme says "no leading space". This means you can decide which of those two lexemes to "blame" for the lack of the space, using frequency statistics.
James's point is that this strategy adds very little entropy to the word prediction decision. Alternate schemes will expand the lexicon with entries like hello. or "Hello. His approach does neither, because you can code the string hello. as either (hello, 1, 0), (., 1, 1) or as (hello, 1, 0), (., 0, 1). This choice is easy: we should definitely "blame" the period for the lack of the space.
TL;DR
I've written a code that attempts to do it, the snippet is below.
Another approach, with a computational complexity of O(n^2) * would be to use a function I just wrote.
The main thought was "What spaCy splits, shall be rejoined once more!"
Code:
#!/usr/bin/env python
import spacy
import string
class detokenizer:
""" This class is an attempt to detokenize spaCy tokenized sentence """
def __init__(self, model="en_core_web_sm"):
self.nlp = spacy.load(model)
def __call__(self, tokens : list):
""" Call this method to get list of detokenized words """
while self._connect_next_token_pair(tokens):
pass
return tokens
def get_sentence(self, tokens : list) -> str:
""" call this method to get detokenized sentence """
return " ".join(self(tokens))
def _connect_next_token_pair(self, tokens : list):
i = self._find_first_pair(tokens)
if i == -1:
return False
tokens[i] = tokens[i] + tokens[i+1]
tokens.pop(i+1)
return True
def _find_first_pair(self,tokens):
if len(tokens) <= 1:
return -1
for i in range(len(tokens)-1):
if self._would_spaCy_join(tokens,i):
return i
return -1
def _would_spaCy_join(self, tokens, index):
"""
Check whether the sum of lengths of spaCy tokenized words is equal to the length of joined and then spaCy tokenized words...
In other words, we say we should join only if the join is reversible.
eg.:
for the text ["The","man","."]
we would joins "man" with "."
but wouldn't join "The" with "man."
"""
left_part = tokens[index]
right_part = tokens[index+1]
length_before_join = len(self.nlp(left_part)) + len(self.nlp(right_part))
length_after_join = len(self.nlp(left_part + right_part))
if self.nlp(left_part)[-1].text in string.punctuation:
return False
return length_before_join == length_after_join
Usage:
import spacy
dt = detokenizer()
sentence = "I am the man, who dont dont know. And who won't. be doing"
nlp = spacy.load("en_core_web_sm")
spaCy_tokenized = nlp(sentence)
string_tokens = [a.text for a in spaCy_tokenized]
detokenized_sentence = dt.get_sentence(string_tokens)
list_of_words = dt(string_tokens)
print(sentence)
print(detokenized_sentence)
print(string_tokens)
print(list_of_words)
output:
I am the man, who dont dont know. And who won't. be doing
I am the man, who dont dont know. And who won't . be doing
['I', 'am', 'the', 'man', ',', 'who', 'do', 'nt', 'do', 'nt', 'know', '.', 'And', 'who', 'wo', "n't", '.', 'be', 'doing']
['I', 'am', 'the', 'man,', 'who', 'dont', 'dont', 'know.', 'And', 'who', "won't", '.', 'be', 'doing']
Downsides:
In this approach you may easily merge "do" and "nt", as well as strip space between the dot "." and preceding word.
This method is not perfect, as there are multiple possible combinations of sentences that lead to specific spaCy tokenization.
I am not sure if there is a method to fully detokenize a sentence when all you have is spaCy separated text, but this is the best I've got.
After having searched for hours on Google, only a few answers came along, with this very stack question being opened on 3 of my tabs on chrome ;), and all it wrote was basically "don't use spaCy, use revtok". As I couldn't change the tokenization other researchers chose, I had to develop my own solution. Hope it helps someone ;)

How to logically segment a sentence using spacy ?

I am new to Spacy and trying to segment a sentence logically, so that I can process each part separately. e.g;
"If the country selected is 'US', then the zip code should be numeric"
This needs to be broken into :
If the country selected is 'US',
then the zip code should be numeric
Another sentence with comas should not be broken:
The allowed states are NY, NJ and CT
Any ideas, thoughts how to do this in spacy ?
I am not sure whether we can do this until we train the model using custom data. But spacy allows to add rules for tokenising and sentence segmenting etc..
The following code may be useful for this particular case and you can change the rules according your requirement.
#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
{'IS_ALPHA': True},
{'ORTH': "'"}]
#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)
#Method to merge matched patterns
def special_merger(doc):
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans:
span.merge()
return doc
#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
for token in doc:
if should_be_sentence_start(token):
token.is_sent_start = True
return doc
#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'" :
return True
else:
return False
#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')
#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
print(sent)
Output:
If the country selected is 'US',
then the zip code should be numeric

Resources