NLP: Spacy custom rule based matching - nlp

I am working on spacy and need to find some information like email, phone number and multiple values from text. Below is my code. However there is something which I am doing wrong in matcher, due to which I am not getting desired output. Below is the code.
import spacy
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
my_pattern = [{"LOWER": "email"}, {"LOWER": "phone"}]
matcher.add('MyPattern', [my_pattern])
my_text = "email: kashif.jilani#sample.com, phone: 1234567"
my_doc = nlp(my_text)
desired_matches = matcher(my_doc)
for match_id, start, end in desired_matches:
string_id = nlp.vocab.strings[match_id]
span = my_doc[start:end]
print(span.text)

First of all you have a problem with the formatting of the patterns. The format has to be a list of patterns and a pattern is a list of dicts.
Following your current patterns, you need to change this :
my_pattern = [{"LOWER": "email"}, {"LOWER": "phone"}]
to this:
my_pattern = [[{"LOWER": "email"}], [{"LOWER": "phone"}]]
But I believe you have a problem because you said in the original post that you want to extract info such as email and phone number but your current only extracts the words email and phone. However, you can use the spacy token matcher to autmatically extract these infos easily using the following patterns:
my_pattern = [[{'LIKE_EMAIL': True}], [{'LIKE_NUM': True}]]
Now, if you change that line, your code will look like:
nlp = spacy.load("en_core_web_sm")
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
my_pattern = [[{'LIKE_EMAIL': True}], [{'LIKE_NUM': True}]]
matcher.add('MyPattern',my_pattern)
my_text = "email: kashif.jilani#sample.com, phone: 1234567"
my_doc = nlp(my_text)
desired_matches = matcher(my_doc)
for match_id, start, end in desired_matches:
string_id = nlp.vocab.strings[match_id]
span = my_doc[start:end]
print(span.text)
# output:
# kashif.jilani#sample.com
# 1234567
You can learn more about rule based matching over here: Rule based Matching

Related

How to search if every word in string starts with any of the word in list using python

I am trying to filter sentences from my pandas data-frame having 50 million records using keyword search. If any words in sentence starts with any of these keywords.
WordsToCheck=['hi','she', 'can']
text_string1="my name is handhit and cannary"
text_string2="she can play!"
If I do something like this:
if any(key in text_string1 for key in WordsToCheck):
print(text_string1)
I get False positive as handhit as hit in the last part of word.
How can I smartly avoid all such False positives from my result set?
Secondly, is there any faster way to do it in python? I am using apply function currently.
I am following this link so that my question is not a duplicate: How to check if a string contains an element from a list in Python
If the case is important you can do something like this:
def any_word_starts_with_one_of(sentence, keywords):
for kw in keywords:
match_words = [word for word in sentence.split(" ") if word.startswith(kw)]
if match_words:
return kw
return None
keywords = ["hi", "she", "can"]
sentences = ["Hi, this is the first sentence", "This is the second"]
for sentence in sentences:
if any_word_starts_with_one_of(sentence, keywords):
print(sentence)
If case is not important replace line 3 with something like this:
match_words = [word for word in sentence.split(" ") if word.lower().startswith(kw.lower())]

Create a list everytime I encounter a certain word in a str

My problem is that I wanted write a code that did that:
input => str_of_words = '<post>30blueyellow<post>2skyearth<post>5summerwinter'
output => post30 = ["blue","yellow"]
post2 = ["sky","earth"]
post5 = ["summer", "winter"]
At first I thought I could do something like
if "<post>" in str_of_words:
occurrence = str_of_words.count("<post>")
#and from there I had no idea how to continue coding it
So I feel like I could ask if anyone knew some tricks to do that
You can use the nltk module:
import re
import nltk
nltk.download('words')
from nltk.corpus import words
def split(a):
for i in range(len(a)):
if a[:i] in words.words() and a[i:] in words.words():
return [a[:i],a[i:]]
str_of_words = '<post>30blueyellow<post>2skyearth<post>5summerwinter'
post = {i:split(j) for i,j in dict(re.findall(r'post>(\d+)(\w+)',str_of_words)).items()}
post['30']
['blue', 'yellow']
post['5']
['summer', 'winter']
post['2']
['sky', 'earth']
this might get you started:
import re
str_of_words = '<post>30blueyellow<post>2skyearth<post>5summerwinter'
posts = {}
lst = str_of_words.split('<post>')
for item in lst:
match = re.match('(\d+)(\D+)', item)
if not match:
continue
posts[int(match.group(1))] = match.group(2)
print(posts)
it prints:
{30: 'blueyellow', 2: 'skyearth', 5: 'summerwinter'}
so posts[30] = 'blueyellow'.
the re module is very helpful when it comes to separating numbers (\d) from non-numbers (\D).
i don't know according to what rules you would like to be able to split the words. do you have a list of words that could appear?

Get string that was matched by regex?

I got this code for a reddit bot:
match = re.findall(r"(?i)\bword1\b|\bword2\b|\bword3\b", comment.body)
which matches several words. How can I print which word was matched?
Look at this example. This may helps you
import re
f=open('sample.txt',"w")
f.write("<p class = m>babygameover</p>")
f.close()
f=open('sample.txt','r')
string = "<p class = m>(.+?)</p>"
pattern = re.compile(string)
text = f.read()
search = re.findall(pattern,text)
print search

How to find a synonyms in a list of strings using NLTK synsets?

I have a list of strings :
words1 = ['feds', 'move', 'to', 'require', 'cartocar', 'safety', 'communication']
I want to find a synsets for each of that words using NLTK wordnet synsets. Firstly, I use one string in my list.
Here's my codes :
from nltk.corpus import wordnet as wn
word = ['feds']
data1 = ' '.join(word)
def getSynonyms(data1):
synonymList1 = []
wordnetSynset1 = wn.synsets(data1)
for synset1 in wordnetSynset1:
for synWords1 in synset1.lemma_names():
synonymList1.append(synWords1)
print synonymList1
print "list of synonyms : ", getSynonyms(data1)
and it works. Here's the result :
list of synonyms : [u'Federal', u'Fed', u'federal_official', u'Federal_Reserve_System', u'Federal_Reserve', u'Fed', u'FRS']
but when I use a list of strings "words1", it doesn't works and the output is none like this >> [].
anyone can help? thanks
You need to pass the words individually and not after joining them.
from nltk.corpus import wordnet as wn
def getSynonyms(word1):
synonymList1 = []
for data1 in word1:
wordnetSynset1 = wn.synsets(data1)
tempList1=[]
for synset1 in wordnetSynset1:
for synWords1 in synset1.lemma_names():
tempList1.append(synWords1)
synonymList1.append(tempList1)
return synonymList1
word1 = ['feds', 'move', 'to', 'require', 'cartocar', 'safety', 'communication']
print getSynonyms(word1)

i get this error "expexted string or buffer"

file = open("C:\\Users\\file.txt")
text = file.read()
def ie_preprocess(text):
sent_tokenizer = PunktSentenceTokenizer(text)
sents=sent_tokenizer.tokenize(text)
print(sents)
word_tokenizer = WordPunctTokenizer()
words =nltk.word_tokenize(sents)
print(words)
tagges = nltk.pos_tag(words)
print(tagges)
ie_preprocess(text)
nltk.word_tokenize() takes in text which is expected to be a string, but you are passing in sents which is a list of sentences.
Instead, you want:
words = nltk.word_tokenize(text)
If you would like to tokenize each sentence into a list of words and get this back as a list of lists, you could use
words = [nltk.word_tokenize(sentence) for sentence in sents]

Resources