I am using spaCy to find location from string - python-3.x

I wrote a code to find locations present in a string.
import spacy
nlp= spacy.load('en')
doc1='Pune, India'
doc2='India, Pune'
doc3='Pune India'
doc4='India Pune'
print([(X.text, X.label_) for X in nlp(doc1).ents])
print([(X.text, X.label_) for X in nlp(doc2).ents])
print([(X.text, X.label_) for X in nlp(doc3).ents])
print([(X.text, X.label_) for X in nlp(doc4).ents])
and my output is:
[('India', 'GPE')]
[('India', 'GPE'), ('Pune', 'GPE')]
[('India', 'GPE')]
[('India', 'GPE')]
How can I get [('India', 'GPE'), ('Pune', 'GPE')] same output for all?

SpaCy is able to understand that 'Pune' is probably a GPE only because of the comma (,) after India in doc2. This isn't the case for other three examples, and it only detected a known GPE, India.
You may have to write some custom code based on the case you have in your dataset. You may try considering the words 'before' and 'after' the detected GPEs and check them if they come in English dictionary. For example, in your cases when only 'India' is detected, you may check 'Pune' in the English word list (you can get it in SpaCy), and if it doesn't exist, it can be considered as a candidate GPE.
Do let me know if you need help in coding this?

This is because the model you are using does not know that 'Pune' is an entity with label 'GPE'.
You can add an extra component to your pipeline with add_pipe() to identify 'Pune' as an entity as well.
Note that the presence of a comma is not the reason why 'Pune' is classified an entity.
Since you got a list of text, let's make a list so match nlp.pipe() can be used.
texts = ['Pune, India', 'India, Pune', 'Pune India', 'India Pune']
LOCATIONS = ["Pune", "India"]
First, you got to find the places where 'Pune' occurs in the text. PhraseMatcher can be used for that.
from spacy.matcher import PhraseMatcher
Create a matcher object and add the patterns to be recognized to it.
nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
matcher.add("LOCATIONS", None, *list(nlp.pipe(LOCATIONS)))
Now define the component function.
def places_component(doc):
doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matcher(doc)]
return doc
Add the component to the pipeline.
nlp.add_pipe(places_component) #last=True
Now try
for doc in nlp.pipe(texts):
print([(ent.text, ent.label_) for ent in doc.ents])
You'll get
[('Pune', 'GPE'), ('India', 'GPE')]
[('India', 'GPE'), ('Pune', 'GPE')]
[('Pune', 'GPE'), ('India', 'GPE')]
[('India', 'GPE'), ('Pune', 'GPE')]
The whole program would be something like
import spacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
LOCATIONS = ["Pune", "India"]
texts = ['Pune, India', 'India, Pune', 'Pune India', 'India Pune']
nlp = spacy.load('en')
matcher = PhraseMatcher(nlp.vocab)
matcher.add("LOCATIONS", None, *list(nlp.pipe(LOCATIONS)))
def places_component(doc):
doc.ents = [Span(doc, start, end, label="GPE") for match_id, start, end in matcher(doc)]
return doc
nlp.add_pipe(places_component) #last=True
for doc in nlp.pipe(texts):
print([(ent.text, ent.label_) for ent in doc.ents])

Related

How to merge entities in spaCy via rules

I want to use some of the entities in spaCy 3's en_core_web_lg, but replace some of what it labeled as 'ORG' as 'ANALYTIC', as it treats the 3 char codes I want to use such as 'P&L' and 'VaR' as organizations. The model has DATE entities, which I'm fine to preserve. I've read all the docs, and it seems like I should be able to use the EntityRuler, with the syntax below, but I'm not getting anywhere. I have been through the training 2-3x now, read all the Usage and API docs, and I just don't see any examples of working code. I get all sorts of different error messages like I need a decorator, or other. Lord, is it really that hard?
my code:
analytics = [
[{'LOWER':'risk'}],
[{'LOWER':'pnl'}],
[{'LOWER':'p&l'}],
[{'LOWER':'return'}],
[{'LOWER':'returns'}]
]
matcher = Matcher(nlp.vocab)
matcher.add("ANALYTICS", analytics)
doc = nlp(text)
# Iterate over the matches
for match_id, start, end in matcher(doc):
# Create a Span with the label for "ANALYTIC"
span = Span(doc, start, end, label="ANALYTIC")
# Overwrite the doc.ents and add the span
doc.ents = list(doc.ents) + [span]
# Get the span's root head token
span_root_head = span.root.head
# Print the text of the span root's head token and the span text
print(span_root_head.text, "-->", span.text)
This of course crashes when my new 'ANALYTIC' entity span collides with the existing 'ORG' one. But I have no idea how to either merge these offline and put them back, or create my own custom pipeline using rules. This is the suggested text from the entity ruler. No clue.
# Construction via add_pipe
ruler = nlp.add_pipe("entity_ruler")
# Construction from class
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp, overwrite_ents=True)
So when you say it "crashes", what's happening is that you have conflicting spans. For doc.ents specifically, each token can only be in at most one span. In your case you can fix this by modifying this line:
doc.ents = list(doc.ents) + [span]
Here you've included both the old span (that you don't want) and the new span. If you get doc.ents without the old span this will work.
There are also other ways to do this. Here I'll use a simplified example where you always want to change items of length 3, but you can modify this to use your list of specific words or something else.
You can directly modify entity labels, like this:
for ent in doc.ents:
if len(ent.text) == 3:
ent.label_ = "CHECK"
print(ent.label_, ent, sep="\t")
If you want to use the EntityRuler it would look like this:
import spacy
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents":True})
patterns = [
{"label": "ANALYTIC", "pattern":
[{"ENT_TYPE": "ORG", "LENGTH": 3}]}]
ruler.add_patterns(patterns)
text = "P&L reported amazing returns this year."
doc = nlp(text)
for ent in doc.ents:
print(ent.label_, ent, sep="\t")
One more thing - you don't say what version of spaCy you're using. I'm using spaCy v3 here. The way pipes are added changed a bit in v3.

Unexpected lemmatize result from gensim

I used following code lemmatize texts that were already excluding stop words and kept words longer than 3. However, after using following code, it split existing words such as 'wheres' to ['where', 's']; 'youre' to ['-PRON-','be']. I didn't expect 's', '-PRON-', 'be' these results in my text, what caused this behaviour and what I can do?
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append([token.lemma_ for token in doc]) # though rare, if only keep the tokens with given posttags, add 'if token.pos_ in allowed_postags'
return texts_out
# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

spaCy issue with 'Vocab' or 'StringStore'

I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)

Replace entity with its label in SpaCy

Is there anyway by SpaCy to replace entity detected by SpaCy NER with its label?
For example:
I am eating an apple while playing with my Apple Macbook.
I have trained NER model with SpaCy to detect "FRUITS" entity and the model successfully detects the first "apple" as "FRUITS", but not the second "Apple".
I want to do post-processing of my data by replacing each entity with its label, so I want to replace the first "apple" with "FRUITS". The sentence will be "I am eating an FRUITS while playing with my Apple Macbook."
If I simply use regex, it will replace the second "Apple" with "FRUITS" as well, which is incorrect. Is there any smart way to do this?
Thanks!
the entity label is an attribute of the token (see here)
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
s = "His friend Nicolas is here."
doc = nlp(s)
print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', '.']
print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here .
Edit:
In order to handle cases were entities can span several words the following code can be used instead:
s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
newString = s
for e in reversed(doc.ents): #reversed to not modify the offsets of other entities when substituting
start = e.start_char
end = start + len(e.text)
newString = newString[:start] + e.label_ + newString[end:]
print(newString)
#His friend PERSON is here with PERSON and PERSON.
Update:
Jinhua Wang brought to my attention that there is now a more built-in and simpler way to do this using the merge_entities pipe.
See Jinhua's answer below.
A more elegant modification to #DBaker's solution above when entities can span several words:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe("merge_entities")
s = "His friend Nicolas J. Smith is here with Bart Simpon and Fred."
doc = nlp(s)
print([t.text if not t.ent_type_ else t.ent_type_ for t in doc])
# ['His', 'friend', 'PERSON', 'is', 'here', 'with', 'PERSON', 'and', 'PERSON', '.']
print(" ".join([t.text if not t.ent_type_ else t.ent_type_ for t in doc]) )
# His friend PERSON is here with PERSON and PERSON .
You can check the documentation on Spacy here. It uses the built in Pipeline for the job and has good support for multiprocessing. I believe this is the officially supported way to replace entities by their tags.
A slightly shorter version of #DBaker answer which uses end_char instead of computing it:
for ent in reversed(doc.ents):
text = text[:ent.start_char] + ent.label_ + text[ent.end_char:]

Lemmatization of words using spacy and nltk not giving correct lemma

I want to get the lemmatized words of the words in list given below:
(eg)
words = ['Funnier','Funniest','mightiest','tighter']
When I do spacy,
import spacy
nlp = spacy.load('en')
words = ['Funnier','Funniest','mightiest','tighter','biggify']
doc = spacy.tokens.Doc(nlp.vocab, words=words)
for items in doc:
print(items.lemma_)
I got the lemmas like:
Funnier
Funniest
mighty
tight
When I go for nltk WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token))
I got:
Funnier : Funnier
Funniest : Funniest
mightiest : mightiest
tighter : tighter
Anyone help for this.
Thanks.
Lemmatisation is totally dependent on the part of speech tag that you are using while getting the lemma of the particular word.
# Define the sentence to be lemmatized
sentence = "The striped bats are hanging on their feet for best"
# Tokenize: Split the sentence into words
word_list = nltk.word_tokenize(sentence)
print(word_list)
#> ['The', 'striped', 'bats', 'are', 'hanging', 'on', 'their', 'feet', 'for', 'best']
# Lemmatize list of words and join
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
print(lemmatized_output)
#> The striped bat are hanging on their foot for best
The above code is a simple example of how to use the wordnet lemmatizer on words and sentences.
Notice it didn’t do a good job. Because, ‘are’ is not converted to ‘be’ and ‘hanging’ is not converted to ‘hang’ as expected. This can be corrected if we provide the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize().
Sometimes, the same word can have a multiple lemmas based on the meaning / context.
print(lemmatizer.lemmatize("stripes", 'v'))
#> strip
print(lemmatizer.lemmatize("stripes", 'n'))
#> stripe
For the above example(), specify the corresponding pos tag:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ['Funnier','Funniest','mightiest','tighter','biggify']
for token in words:
print(token + ' --> ' + lemmatizer.lemmatize(token, wordnet.ADJ_SAT))

Resources