Concatenate two spacy docs together? - nlp

How do I concatenate two spacy docs together? To merge them into one?
import spacy
nlp = spacy.load('en')
doc1 = nlp(u'This is the doc number one.')
doc2 = nlp(u'And this is the doc number two.')
new_doc = doc1+doc2
Of course that will return an error as a doc object is not concatenable by default. Is there a straightforward solution to do that?
I looked at this:
https://github.com/explosion/spaCy/issues/2229
The issue seems closed so it sounds like they have implemented a solution but I cannot find a simple example of that being used.

What about this:
import spacy
from spacy.tokens import Doc
nlp = spacy.blank('en')
doc1 = nlp(u'This is the doc number one.')
doc2 = nlp(u'And this is the doc number two.')
# Will work for few Docs, but see further recommendations below
docs=[doc1, doc2]
# `c_doc` is your "merged" doc
c_doc = Doc.from_docs(docs)
print("Merged text: ", c_doc.text)
# Some quick checks: should not trigger any error.
assert len(list(c_doc.sents)) == len(docs)
assert [str(ent) for ent in c_doc.ents] == [str(ent) for doc in docs for ent in doc.ents]
For "a lot" of different sentences, it might be better to use nlp.pipe as shown in the documentation.
Hope it helps.

Related

How to return given word and dependency using spacy

I am experimenting with spacy for information extraction and would like to return given tokens, such as object of preposition (pobj) and any compounds.
For the example below I am trying to write code that will return 'radar swivel'
So far I have tried:
#component/assy
import spacy
# load english language model
nlp = spacy.load('en_core_web_sm', disable=['ner','textcat'])
def component(text):
doc = nlp(text)
for token in doc:
# extract object
if (token.dep_=='pobj'):
return(token.text)
elif (token.dep_=='compound'):
return(token.text)
df['Component'] = df['Text'].apply(lambda x: component(x))
df.head()
This returns the word 'swivel' but not the proceeded compound 'radar', is there a way I can rewrite the code to detect the pobj and return this with any associated compounds? Thanks!
The return statements are breaking the loop that's why when you arrive at the token which is pobj you move to the next sentence without checking the compounds on the other.
To fix that you use the following function. Once it finds a pobj it looks at its children and checks which ones are compounds:
def component(text):
doc = nlp(text)
for token in doc:
if (token.dep_=='pobj'):
compounds = [child.text for child in token.children if child.dep_ == "compound"]
yield " ".join(compounds) + " " + token.text
df['Component'] = df['Text'].apply(lambda x: list(component(x)))
df.head()

How to merge entities in spaCy via rules

I want to use some of the entities in spaCy 3's en_core_web_lg, but replace some of what it labeled as 'ORG' as 'ANALYTIC', as it treats the 3 char codes I want to use such as 'P&L' and 'VaR' as organizations. The model has DATE entities, which I'm fine to preserve. I've read all the docs, and it seems like I should be able to use the EntityRuler, with the syntax below, but I'm not getting anywhere. I have been through the training 2-3x now, read all the Usage and API docs, and I just don't see any examples of working code. I get all sorts of different error messages like I need a decorator, or other. Lord, is it really that hard?
my code:
analytics = [
[{'LOWER':'risk'}],
[{'LOWER':'pnl'}],
[{'LOWER':'p&l'}],
[{'LOWER':'return'}],
[{'LOWER':'returns'}]
]
matcher = Matcher(nlp.vocab)
matcher.add("ANALYTICS", analytics)
doc = nlp(text)
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Create a Span with the label for "ANALYTIC"
span = Span(doc, start, end, label="ANALYTIC")
# Overwrite the doc.ents and add the span
doc.ents = list(doc.ents) + [span]
# Get the span's root head token
span_root_head = span.root.head
# Print the text of the span root's head token and the span text
print(span_root_head.text, "-->", span.text)
This of course crashes when my new 'ANALYTIC' entity span collides with the existing 'ORG' one. But I have no idea how to either merge these offline and put them back, or create my own custom pipeline using rules. This is the suggested text from the entity ruler. No clue.
# Construction via add_pipe
ruler = nlp.add_pipe("entity_ruler")
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)
So when you say it "crashes", what's happening is that you have conflicting spans. For doc.ents specifically, each token can only be in at most one span. In your case you can fix this by modifying this line:
doc.ents = list(doc.ents) + [span]
Here you've included both the old span (that you don't want) and the new span. If you get doc.ents without the old span this will work.
There are also other ways to do this. Here I'll use a simplified example where you always want to change items of length 3, but you can modify this to use your list of specific words or something else.
You can directly modify entity labels, like this:
for ent in doc.ents:
if len(ent.text) == 3:
ent.label_ = "CHECK"
print(ent.label_, ent, sep="\t")
If you want to use the EntityRuler it would look like this:
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents":True})
patterns = [
{"label": "ANALYTIC", "pattern":
[{"ENT_TYPE": "ORG", "LENGTH": 3}]}]
ruler.add_patterns(patterns)
text = "P&L reported amazing returns this year."
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent, sep="\t")
One more thing - you don't say what version of spaCy you're using. I'm using spaCy v3 here. The way pipes are added changed a bit in v3.

spaCy issue with 'Vocab' or 'StringStore'

I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)

POS tagging a single word in spaCy

spaCy POS tagger is usally used on entire sentences. Is there a way to efficiently apply a unigram POS tagging to a single word (or a list of single words)?
Something like this:
words = ["apple", "eat", good"]
tags = get_tags(words)
print(tags)
> ["NNP", "VB", "JJ"]
Thanks.
English unigrams are often hard to tag well, so think about why you want to do this and what you expect the output to be. (Why is the POS of apple in your example NNP? What's the POS of can?)
spacy isn't really intended for this kind of task, but if you want to use spacy, one efficient way to do it is:
import spacy
nlp = spacy.load('en')
# disable everything except the tagger
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "tagger"]
nlp.disable_pipes(*other_pipes)
# use nlp.pipe() instead of nlp() to process multiple texts more efficiently
for doc in nlp.pipe(words):
if len(doc) > 0:
print(doc[0].text, doc[0].tag_)
See the documentation for nlp.pipe(): https://spacy.io/api/language#pipe
You can do something like this:
import spacy
nlp = spacy.load("en_core_web_sm")
word_list = ["apple", "eat", "good"]
for word in word_list:
doc = nlp(word)
print(doc[0].text, doc[0].pos_)
alternatively, you can do
import spacy
nlp = spacy.load("en_core_web_sm")
doc = spacy.tokens.doc.Doc(nlp.vocab, words=word_list)
for name, proc in nlp.pipeline:
doc = proc(doc)
pos_tags = [x.pos_ for x in doc]

How do I use generator objects in spaCy?

First experience with NLP here. I have about a half million tweets. I'm trying to use spacy to remove stop words, lemmatize, etc. and then pass the processed text to a classification model. Because of the size of the data I need multiprocessing to do this in reasonable speed, but can't figure out what to do with the generator object once I have it.
Here I load spacy and pass the data through the standard pipeline:
nlp = spacy.load('en')
tweets = ['This is a dummy tweet for stack overflow',
'What do we do with generator objects?']
spacy_tweets = []
for tweet in tweets:
doc_tweet = nlp.pipe(tweet, batch_size = 10, n_threads = 3)
spacy_tweets.append(doc_tweet)
Now I'd like to take the Doc objects spaCy creates and then process the text with something like this:
def spacy_tokenizer(tweet):
tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
tweet = [tok for tok in tweet if (tok not in stopwords and tok not in punctuations)]
return tweet
But this doesn't work because spaCy returns generator objects when using the .pipe() method. So when I do this:
for tweet in spacy_tweets:
print(tweet)
It prints the generator. Okay, I get that. But when I do this:
for tweet in spacy_tweets[0]:
print(tweet)
I would expect it to print the Doc object or the text of the tweet in the generator but it doesn't do that. Instead it prints each character our individually.
Am I approaching this wrong or is there something I need to do in order to retrieve the Doc objects from the generator objects so I can use the spaCy attributes for lemmatizing etc.?
I think that you are using wrongly the nlp.pipe command.
nlp.pipe is for parallelization which means that it processes simultaneously tweets. So, instead of giving to nlp.pipe command a single tweet as an argument, you should pass the tweets list.
The following code seems to achieve your goal:
import spacy
nlp = spacy.load('en')
tweets = ['This is a dummy tweet for stack overflow',
'What do we do with generator objects?']
spacy_tweets = nlp.pipe(tweets, batch_size = 10, n_threads = 3)
for tweet in spacy_tweets:
for token in tweet:
print(token.text, token.pos_)
Hope it helps!

Resources