In spacy, Is it possible to get the corresponding rule id in a match of matches - nlp

In Spacy 2.x, I use the matcher to find specific tokens in my text corpus. Each rule has an ID ('class-1_0' for example). During parse, I use the callback on_match to handle each match. Is there a solution to retrieve the rule used to find the match directly in the callback.
Here is my sample code.
txt = ("Aujourd'hui, je vais me faire une tartine au beurre "
"de cacahuète, c'est un pilier de ma nourriture "
"quotidienne.")
nlp = spacy.load('fr')
def on_match(matcher, doc, id, matches):
span = doc[matches[id][1]:matches[id][2]]
print(span)
# find a way to get the corresponding rule without fuzz
matcher = Matcher(nlp.vocab)
matcher.add('class-1_0', on_match, [{'LEMMA': 'pilier'}])
matcher.add('class-1_1', on_match, [{'LEMMA': 'beurre'}, {'LEMMA': 'de'}, {'LEMMA': 'cacahuète'}])
doc = nlp(txt)
matches = matcher(doc)
In this case matches return :
[(12071893341338447867, 9, 12), (4566231695725171773, 16, 17)]
12071893341338447867 is a unique ID based on class-1_0. I cannot find the original rule name, even if I do some introspection in matcher._patterns.
It would be great if someone can help me.
Thank you very much.

Yes – you can simply look up the ID in the StringStore of your vocabulary, available via nlp.vocab.strings or doc.vocab.strings. Going via the Doc is pretty convenient here, because you can do so within your on_match callback:
def on_match(matcher, doc, match_id, matches):
string_id = doc.vocab.strings[match_id]
For efficiency, spaCy encodes all strings to integers and keeps a reference to the mapping in the StringStore lookup table. In spaCy v2.0, the integers are hash values, so they'll always match across models and vocabularies. Fore more details on this, see this section in the docs.
Of course, if your classes and IDs are kinda cryptic anyways, the other answer suggesting integer IDs will work fine, too. Just keep in mind that those integer IDs you choose will likely also be mapped to some random string in the StringStore (like a word, or a part-of-speech tag or something). This usually doesn't matter if you're not looking them up and resolving them to strings somewhere – but if you do, the output may be confusing. For example, if your matcher rule ID is 99 and you're calling doc.vocab.strings[99], this will return 'VERB'.

While writing my question, as often, I found the solution.
It's dead simple, instead of using unicode rule id, like class-1_0, simply use a interger. The identifier will be preserved throughout the process.
matcher.add(1, on_match, [{'LEMMA': 'pilier'}])
Match with
[(1, 16, 17),]

Related

Is there a way to only list a certain format of text from a list?

I am quite new to python.
And i want to only get a certain format from a bigger list, example:
Whats in the list:
/ABC/EF213
/ABC/EF
/ABC/12AC4
/ABC/212
However the only on i want listed are the ones with this format /###/##### while the rest gets discarded
You could use a generator expression or a for loop to check each element of the list to see if it matches a pattern. One way of doing this would be to check if the item matches a regex pattern.
As an example:
import re
original_list = ["Item I don't want", "/ABC/EF213", "/ABC/EF", "/ABC/12AC4", "/ABC/212", "123/456", "another useless item", "/ABC/EF"]
filtered_list = [item for item in original_list if re.fullmatch("\/\w+\/\w+", item) is not None]
print(filtered_list)
outputs
['/ABC/EF213', '/ABC/EF', '/ABC/12AC4', '/ABC/212', '/ABC/EF']
If you need help making regex patterns, there are many great websites such as regexr which can help you
Every String can be used as a list without any conversion. If the only format you want to check is /###/##### then you can simply make if commands like these:
for text in your_list:
if len(text) == 10 and text[0] == "/" and text[4] == "/" (and so on):
print(text)
Of course this would require a lot of if statements and would take a pretty long time. So I would recomend doing a faster and simpler scan. We could perform this one by, for example, splitting the texts, which would look something like this:
for text in your_list:
checkstring = text.split("/")
Now you have your text Split in parts, and you can simply check what lengths these new parts have with the len() command.

How do I create a search using NLP techniques which searches an inputted named entity as well as any potential name variations it may have?

I’m currently using TextBlob to make a chatbot, and I’ve so far been extracting named entities using noun phrase extraction and finding the pos tag NNP. When entering a test user question such as ‘Will Smith’s latest single?’, I am correctly retrieving ‘Will Smith’. But I want to be able to search not only ‘will smith’ but ‘william smith’ ‘bill smith’ ‘willie smith’ ‘billy smith’ - basically other popularly known variations of the name in English language. I am using the Spotipy API as I am trying to retrieve Spotify artists. What I'm currently doing in PyCharm:
while True:
response = input()
searchQuery = TextBlob(response)
who = []
for item, tag in searchQuery.tags:
if tag == "NNP":
for nounPhrase in searchQuery.noun_phrases:
np = TextBlob(nounPhrase)
if item.lower() in np.words:
if nounPhrase not in who:
who.append(nounPhrase)
print(who)
if who:
for name in who:
if spotifyObject.search(name, 50, 0, 'artist', None):
searchResults = spotifyObject.search(name, 50, 0, 'artist', None)
artists = searchResults['artists']['items']
for a in artists:
print(a['name'])
Quick question:
Why would you want 'Bill Smith' to appear under the same search for Will Smith?
I believe they are 2 different artists.
Option 1
If I understand your question correctly, I believe you may want to use regular expressions on the first name of the artist.
For example name LIKE %(any fist name)% + smith
As I assume the search is invalid in your case if the search returns "Will Sutton" for example.
Option 2
Do you want something similar to SpaCy's sense2vec feature. Which returns the word with percentage similarity. You could set a target that only returns results >70% for example.
https://explosion.ai/demos/sense2vec
If this is not useful, then explain your question again; in a bit more detail (such as what makes a valid search case)
Thanks

using regular expressions isolate the words with ei or ie in it

How do I use regular expressions isolate the words with ei or ie in it?
import re
value = ("How can one receive one who over achieves while believing that he/she cannot be deceived.")
list = re.findall("[ei,ie]\w+", value)
print(list)
it should print ['receive', 'achieves', 'believing', 'deceived'], but I get ['eceive', 'er', 'ieves', 'ile', 'elieving', 'eceived'] instead.
The set syntax [] is for individual characters, so use (?:) instead, with words separated by |. This is like using a group, but it doesn't capture a match group like () would. You also want the \w on either side to be captured to get the whole word.
import re
value = ("How can one receive one who over achieves while believing that he/she cannot be deceived.")
list = re.findall("(\w*(?:ei|ie)\w*)", value)
print(list)
['receive', 'achieves', 'believing', 'deceived']
(I'm assuming you meant "achieves", not "achieve" since that's the word that actually appears here.)

Finding Close String Matches - valuing sub string word matches higher

I'm trying to find close string matches (context - searching for a discord user from user input).
Atm, I'm trying out the difflib. It works ok, but seems to return some funny results sometimes. Eg. if someone's name contains a word, searching that word may result in something that seems far off instead of that name.
I think that's just because of how get_close_matches works. Could I be suggested some other libraries to try out? (Not sure how to quantify what I'm after, but maybe I want a searcher that gives a higher score to names containing a similar word to the search term)
user_names = []
for member in server.members:
if member.name is not None: user_names.append(member.name)
if member.nick is not None: user_names.append(member.nick)
user_name = difflib.get_close_matches(user_msg, user_names, n = 1, cutoff = 0.2)
I've used https://github.com/seatgeek/fuzzywuzzy for this in the past which provides a few options out of the box from single words to tokenizing and sorting larger strings.

What is Natural Language Processing Doing Exactly in This Code?

I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.

Resources