Using British National Corpus in NLTK - python-3.x

I am new to NLTK (http://www.nltk.org/), and python for that matter. I wish to use the NLTK python library, but use the BNC for the corpus. I do not believe this corpus is distributed through the NLTK Data download. Is there a way to import the BNC corpus to be used by NLTK. If so, how? I did find a function called BNCCorpusReader but have no idea how to use it. Also, at the BNC site, I was able to download the corpus (http://ota.ox.ac.uk/desc/2554).
http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word
Update
I have tried entrophy's suggestion, but get the following error:
raise IOError('No such file or directory: %r' % _path)
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'
My code to read in the corpora:
bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')
And by corpora is located in:
C:\Users\jason\Documents\NetBeansProjects\DemoCollocations\src\Corpora\bnc\

In regards to examples usage of nltk for collocation extraction, take a look at the following guide: A how-to guide by nltk on collocations extraction
As far as BNC corpus reader is concerned, all the information was right there in the documentation.
from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')
#And say you wanted to extract all bigram collocations and
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.
list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)
print(scored)
The output of that will look something like this:
[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699),
(('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894),
((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]
And if you wanted to sort them using the score, you could try something like this
sorted_bigrams = sorted(bigram for bigram, score in scored)
print(sorted_bigrams)
Resulting:
[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'),
('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'),
('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]

Related

Lemmatizing words like "movin", "roamin" or any sort of slang word

I'm currently using nltk and spacy's lemmatiser (both accuracy and large datasets used) to test. But they are simply not able to lemmatize slangs words in dictionary properly.
import spacy
!python3 -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg", disable=['parser', 'ner'])
doc = nlp("Hello I am movin to the target so i can roamin around")
for token in doc:
print(token, token.lemma_)
Hello hello
I I
am be
movin movin
to to
the the
target target
so so
i I
can can
roamin roamin
around around```
This is my output but I would want
movin -> move
roamin -> roam
thank you!

What is *.subwords file in natural language processing to use as vocabulary file?

I have been trying to create a vocab file in a nlp task to use in tokenize method of trax to tokenize the word but i can't find which module/library to use to create the *.subwords file. Please help me out?
The easiest way to use the trax.data.Tokenize with your own data and a subword vocabulary it's using Google's Sentencepiece python module
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=data/my_data.csv --model_type=bpe --model_prefix=my_model --vocab_size=32000')
This creates two files:
my_model.model
my_model.vocab
We'll use this model in trax.data.Tokenize and we'll add the parameter vocab_type with the value "sentencepiece"
trax.data.Tokenize(vocab_dir='vocab/', vocab_file='my_model.model', vocab_type='sentencepiece')
I think it's the best way since you can load the model and use it to get control ids while avoiding hardcode
sp = spm.SentencePieceProcessor()
sp.load('my_model.model')
print('bos=sp.bos_id()=', sp.bos_id())
print('eos=sp.eos_id()=', sp.eos_id())
print('unk=sp.unk_id()=', sp.unk_id())
print('pad=sp.pad_id()=', sp.pad_id())
sentence = "hello world"
# encode: text => id
print("Pieces: ", sp.encode_as_pieces(sentence))
print("Ids: ", sp.encode_as_ids(sentence))
# decode: id => text
print("Decode Pieces: ", sp.decode_pieces(sp.encode_as_pieces(sentence)))
print("Decode ids: ", sp.decode_ids(sp.encode_as_ids(sentence)))
print([sp.bos_id()] + sp.encode_as_ids(sentence) + [sp.eos_id()])
If still you want to have a subword file, try this:
python trax/data/text_encoder_build_subword.py \
--corpus_filepattern=data/data.txt --corpus_max_lines=40000 \
--output_filename=data/my_file.subword
I hope this can help since there is no clear literature to see how to create compatible subword files out there
You can use tensorflow API SubwordTextEncoder
Use following code snippet -
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(text_row for text_row in text_dataset), target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
Tensorflow will append .subwords extension to above vocab file.

singularize noun phrases with spacy

I am looking for a way to singularize noun chunks with spacy
S='There are multiple sentences that should include several parts and also make clear that studying Natural language Processing is not difficult '
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentences', 'parts', 'Processing']
I am looking for a way to singularize those roots of the chunks.
GOAL: Singulirized: ['sentence', 'part', 'Processing']
Is there any obvious way? Is that always depending on the POS of every root word?
Thanks
note:
I found this: https://www.geeksforgeeks.org/nlp-singularizing-plural-nouns-and-swapping-infinite-phrases/
but that approach looks to me that leads to many many different methods and of course different for every language. ( I am working in EN, FR, DE)
To get the basic form of each word, you can use ".lemma_" property of chunk or token property
I use Spacy version 2.x
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words')
print (" ".join([token.lemma_ for token in doc]))
and the output :
do display word
Hope it helps :)
There is! You can take the lemma of the head word in each noun chunk.
[chunk.root.lemma_ for chunk in doc.noun_chunks]
Out[82]: ['sentence', 'part', 'processing']

spaCy issue with 'Vocab' or 'StringStore'

I am training the Rasa NLU using spaCy for the pipeline, but when I try to train it I get this error from spaCy:
KeyError: "[E018] Can't retrieve string for hash '18446744072967274715'. This usually refers to an issue with the `Vocab` or `StringStore`."
I have python 3.7.3, spaCy version is 2.2.3, rasa version 1.6.1
Does someone knows how to fix this issue?
that's Sounds like a named mistake, I guess you applied a matcher for a text on another one, and the matcher_id became different, so that ist's getconfused.
to solve it make sure that you use the same matcher on the same text, like below:
Perform standard imports, reset nlp , PhraseMatcher library
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)
dd = 'refers to the economic policies associated with supply-side economics, voodoo economics'
doc3 = nlp(dd) # convert string to spacy.tokens.doc.Doc
First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'free-market economics']
Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]
Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)
Build a list of matches:
matches = matcher(doc3)
matches #(match_id, start, end)
Viewing Matches:
for match_id, start, end in matches: # the matcher have to be the same one that we build on this text
string_id = nlp.vocab.strings[match_id]
span = doc3[start:end]
print(match_id, string_id, start, end, span.text)

How to check for unreadable OCRed text with NLTK

I am using NLTK to analyze a corpus that has been OCRed. I'm new to NLTK. Most of the OCR is good -- but sometimes I come across lines that are plainly junk. For instance: oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5
I want to identify (and filter out) such lines from my analysis.
How do NLP practitioners handle this situation? Something like: if 70 % of the words in the sentence are not in wordnet, discard. Or if NLTK can't identify the part of speech for 80% of the word, then discard? What algorithms work for this? Is there a "gold standard" way to do this?
Using n-grams is probably your best option. You can use google n-grams, or you can use n-grams built into nltk. The idea is to create a language model and see what probability any given sentence gets. You can define a probability threshold, and all sentences with scores below it are removed. Any reasonable language model will give a very low score for the example sentence.
If you think that some words may be only slightly corrupted, you may try spelling correction before testing with the n-grams.
EDIT: here is some sample nltk code for doing this:
import math
from nltk import NgramModel
from nltk.corpus import brown
from nltk.util import ngrams
from nltk.probability import LidstoneProbDist
n = 2
est = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(n, brown.words(categories='news'), estimator=est)
def sentenceprob(sentence):
bigrams = ngrams(sentence.split(), n)
sentence = sentence.lower()
tot = 0
for grams in bigrams:
score = lm.logprob(grams[-1], grams[:-1])
tot += score
return tot
sentence1 = "This is a standard English sentence"
sentence2 = "oomfi ow Ba wmnondmam BE wBwHo<oBoBm. Bowman as: Ham: 8 ooww om $5"
print sentenceprob(sentence1)
print sentenceprob(sentence2)
The results look like:
>>> python lmtest.py
42.7436688972
158.850086668
Lower is better. (Of course, you can play with the parameters).

Resources