I have a list of regex expressions that I want to find in certain docs.
x = ['\bin\sapp\sdata\b','\bin\sapp\sdata\b','\benough\sdata\b']
The patterns repeat themselves so I converted them to a set (see the first and second values in the list)
y = set(x)
When I try to find them in a specific doc it doesn't find them since it doesn't take them as a repr version:
import pandas as pd
import re
results = list()
doc = 'they wanted in app data and we did not provide it'
for value in y:
results.append(re.findall(pattern = value,string=doc))
results = list(filter(None, results))
results
How do I overcome this?
Thanks
The problem was with the python 3.7 version. The error I got was "bad escape \l at position 0" Once I changed the re to regex it worked perfectly fine, even with the "messed up coding
Spacy-lookup is an entity matcher for very large dictionaries, which uses the FlashText module.
It seems that punctuation in the second case below prevents it from matching the entity.
Does someone know why this occurs and how it can be solved?
import spacy
from spacy_lookup import Entity
nlp = spacy.load("en_core_web_sm", disable = ['NER'])
entity = Entity(keywords_list=['vitamin D'])
nlp.add_pipe(entity, last=True)
#works for this sentence:
doc = nlp("vitamin D is contained in this.")
print([token.text for token in doc if token._.is_entity])
#['vitamin D']
#does not work for this sentence:
doc = nlp("This contains vitamin D.")
print([token.text for token in doc if token._.is_entity])
#[]
edit: interestingly, this does not occur when one directly uses the flashtext library (upon which spacy-lookup is based) :
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
keyword_processor.add_keyword('vitamin D')
keywords_found = keyword_processor.extract_keywords("This contains vitamin D.", span_info=True)
print(keywords_found)
# [('vitamin D', 14, 23)]
edit : As Anwarvic pointed out, the problem comes from the way the default tokenizer is splitting the string.
edit : I am trying to find a general solution which does not for example involve adding spaces before every punctuation point. (basically looking for a solution which does not involve reformatting the input text)
The solution is pretty simple ... put space after "D" like so:
>>> doc = nlp("This contains vitamin D .") #<-- space after D
>>> print([token.text for token in doc if token._.is_entity])
['vitamin D']
Why does it happen? Simply because spaCy considered "D." as a whole token the same way the "D." in the name "D. Cooper" is considered a whole token!
I have found the following code in Python that is doing the same work but it only replaces with manually selected synonym.
import nltk
from nltk.corpus import wordnet
synonyms = []
string="i love winter season"
for syn in wordnet.synsets("love"):
for l in syn.lemmas():
synonyms.append(l.name())
print(synonyms)
rep=synonyms[2]
st=string.replace("love",rep, 1)
print(st)
rep=synonyms[2] will be taking any synonym at index 2
What i want is to replace the selected word with its randomly selected synonym?
If I understand your question correctly, what you need is to select a random element from a list. This can be done in python like so:
import random
random.choice (synonyms)
As answered here
I am new to NLTK (http://www.nltk.org/), and python for that matter. I wish to use the NLTK python library, but use the BNC for the corpus. I do not believe this corpus is distributed through the NLTK Data download. Is there a way to import the BNC corpus to be used by NLTK. If so, how? I did find a function called BNCCorpusReader but have no idea how to use it. Also, at the BNC site, I was able to download the corpus (http://ota.ox.ac.uk/desc/2554).
http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word
Update
I have tried entrophy's suggestion, but get the following error:
raise IOError('No such file or directory: %r' % _path)
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'
My code to read in the corpora:
bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')
And by corpora is located in:
C:\Users\jason\Documents\NetBeansProjects\DemoCollocations\src\Corpora\bnc\
In regards to examples usage of nltk for collocation extraction, take a look at the following guide: A how-to guide by nltk on collocations extraction
As far as BNC corpus reader is concerned, all the information was right there in the documentation.
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')
#And say you wanted to extract all bigram collocations and
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.
list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(scored)
The output of that will look something like this:
[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699),
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894),
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]
And if you wanted to sort them using the score, you could try something like this
sorted_bigrams = sorted(bigram for bigram, score in scored)
print(sorted_bigrams)
Resulting:
[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'),
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'),
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]
I know that nltk contains the VerbNet corpus, however, the Unified Verb Index combines information from it and 3 other useful sources. Is there any way to use this corpus in Python?
Through NLTK you can certainly access FrameNet, VerbNet and PropBank. I haven't done any work with the OntoNotes Sense Groupings.
Look at the below for an idea of how to get information out of these three resources. Each of them returns a list so you can grab list elements individually and examine them in however much detail you need.
from nltk.corpus import verbnet as vn
from nltk.corpus import framenet as fn
from nltk.corpus import propbank as pb
input = 'take'
vn_results = vn.classids(lemma=input)
if not vn_results:
print input + ' not in verbnet.'
else:
print 'verbnet:'
print vn_results
fn_results = fn.frames_by_lemma(input)
if not fn_results:
print input + ' not in framenet.'
else:
print 'framenet:'
print fn_results
pb_results = []
try:
pb_results = pb.rolesets(input)
except ValueError:
print input + ' not in propbank.'
if pb_results:
print 'propbank:'
print pb_results