I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.
I've tried PlaintextCorpusReader but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader class?
Can you also lead me to how I can write the segmented data into text files?
After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
newcorpus/
file1.txt
file2.txt
...
Simply use these lines of code and you can get a corpus:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'newcorpus/' # Directory of corpus.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
NOTE: that the PlaintextCorpusReader will use the default nltk.tokenize.sent_tokenize() and nltk.tokenize.word_tokenize() to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.
Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]
# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
filename+=1
with open(corpusdir+str(filename)+'.txt','w') as fout:
print>>fout, text
# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
assert open(corpusdir+infile,'r').read().strip() == text.strip()
# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
print infile # The fileids of each file.
with newcorpus.open(infile) as fin: # Opens the file.
print fin.read().strip() # Prints the content of the file
print
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and
# nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']
I think the PlaintextCorpusReader already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = './'
>>> newcorpus = PlaintextCorpusReader(corpus_root, '.*')
"""
if the ./ dir contains the file my_corpus.txt, then you
can view say all the words it by doing this
"""
>>> newcorpus.words('my_corpus.txt')
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
filecontent1 = "This is a cow"
filecontent2 = "This is a Dog"
corpusdir = 'nltk_data/'
with open(corpusdir + 'content1.txt', 'w') as text_file:
text_file.write(filecontent1)
with open(corpusdir + 'content2.txt', 'w') as text_file:
text_file.write(filecontent2)
text_corpus = PlaintextCorpusReader(corpusdir, ["content1.txt", "content2.txt"])
no_of_words_corpus1 = len(text_corpus.words("content1.txt"))
print(no_of_words_corpus1)
no_of_unique_words_corpus1 = len(set(text_corpus.words("content1.txt")))
no_of_words_corpus2 = len(text_corpus.words("content2.txt"))
no_of_unique_words_corpus2 = len(set(text_corpus.words("content2.txt")))
enter code here
Related
In the below code I want to build a bigram language model with good turing discounting. The training files are the first 150 files of the WSJ treebank, while the test ones are the remaining 49.
Problem :
After building the model and calling the test data, I must check the test tokens and replace those not included in the vocabulary by . However, I do not know how to access the learned model's vocabulary. Could you help with the code here ?
Any assistance is much appreciated. Thank you in advance.
from nltk.corpus import treebank
from nltk.util import pad_sequence
from nltk.util import bigrams, trigrams
from nltk.lm import Laplace
from nltk.probability import SimpleGoodTuringProbDist
from nltk import FreqDist
from nltk.lm.preprocessing import padded_everygram_pipeline
# training data
train_treebank = []
for j in range(150):
for i in treebank.sents(treebank.fileids()[j]):
train_treebank.append(i)
# training bigrams
train_bigrams = []
for sent in train_treebank :
train_bigrams.append(list(bigrams(pad_sequence(sent,
pad_left=True, left_pad_symbol="<START>",
pad_right=True, right_pad_symbol="<END>",
n=2))))
train_bigrams_onelist = [item for sublist in train_bigrams for item in sublist]
# learn good turing language model
freq_dist_bigrams = FreqDist(train_bigrams_onelist)
model = SimpleGoodTuringProbDist(freq_dist_bigrams)
# test data
test_treebank = []
for j in range(150, 199): # len(treebank.fileids()) = 199
for i in treebank.sents(treebank.fileids()[j]):
test_treebank.append(i)
# replace test tokens not included in vocabulary by <UNK>
# how to do it ?
First time using word2vec and the file I am working with is in XML format. I want to iterate through the patents to find each Title then apply word2vec to see if there are similar words(to indicate similar titles).
So far I have parsed the XML file using Element tree to retrieve each title, then I have applied sent_tokenizer followed by tweet tokenizer to return a list of sentences where each word has been tokenized (not sure if this was the best method). I then put the tokenized sentenses into my word2vec model and tested with one word to see if it returned a vector. This seems to only work for a word in the first sentence. I'm not sure it is recognising all the sentences?
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
print(model['Solar'])
I would expect it to identify the word 'solar' in a sentence and print out the vector then I could look for similar words. I am receiving the error:
word 'Solar' not in vocabulary"
Just handle the errors as exceptions on first loop occurence.
# print(model['Solar'])
try:
print(model['Solar'])
except Exception as e:
pass
Working code :
import numpy as np
import pandas as pd
import gensim
import nltk
import xml.etree.ElementTree as ET
from gensim.models.word2vec import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer, sent_tokenize
tree = ET.parse('6785.xml')
root = tree.getroot()
for child in root.iter("Title"):
Patent_Title = child.text
sentence = Patent_Title
stopWords = set(stopwords.words('english'))
tokens = nltk.sent_tokenize(sentence)
print(tokens)
tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in tokens]
#print(tokens_sentences)
model = gensim.models.Word2Vec(tokens_sentences, min_count=1,size=32)
words = list(model.wv.vocab)
print(words)
try:
print(model['Solar'])
except Exception as e:
pass
It is simply because Solar is not in your corpus.
Word2Vec tries to generate word vectors for each word in your tokens_sentences. If the training corpus didn't include the word/token that you try to look up, word2vec would not have the word vector for that word and that is why you got an error.
Advice: try to make your text data case-insensitive. That is, make all the text lower case (upper case works too but not the convention.)
I want to create a word cloud with the variable 'word' that will show me the cloud of all words that are 'NN' and 'NNP'
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
# Using Textblob
for word,noun in blob.tags:
if noun in ['NN','NNP']:
print(f'{word} ==> {noun}')
where will I add the following code:
Create and generate a word cloud image:
wordcloud = WordCloud().generate(word)
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
First you need to shortlist the words and generate a list object like something below:
words = []
for word,noun in blob.tags:
if noun in ['NN','NNP']:
print(f'{word} ==> {noun}')
words.append(word)
And then, you can feed the above word list into word cloud generator as below, optionally you can mention a list of stopwords:
wordcloud = WordCloud(stopwords=STOPWORDS).generate(' '.join(words))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
This question already has an answer here:
Why is my NLTK function slow when processing the DataFrame?
(1 answer)
Closed 4 years ago.
I want to remove stop words and punctuations in Amazon_baby.csv.
import pandas as pd
data=pd.read_csv('amazon_baby.csv)
data.fillna(value='',inplace=True)
data.head()
import string
from nltk.corpus import stopwords
def text_process(msg):
no_punc=[char for char in msg if char not string.punctuation]
no_punc=''.join(no_punc)
return [word for word in no_punc.split() if word.lower() not in stopwords.words('English')]
data['review'].apply(text_process)
This code executing upto 10k rows , if apply on entire dataset kernel always showing as busy and cell is not executing .
Please help on this.
Find the data set here.
You are processing the data char by char which is extremely slow.
It is because of the vast size of the data (~183531 rows) and we have to process each row individually which makes the complexity to O(n2).
I have implemented a slightly different approach using word_tokenize below:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def remove_punction_and_stopwords(msg):
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(msg)
filtered_words = [w for w in msg if w not in word_tokens and w not in string.punctuation]
new_sentence = ''.join(filtered_words)
return new_sentence
I tried running it for 6 mins and it processed 136322 rows. I'm sure if I had run it for 10 mins it would have completed execution successfully.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def text_clean(msg):
tokens=word_tokenize(msg)
tokens=[w.lower() for w in tokens]
import string
stop_words=set(stopwords.words('english))
no_punc_and_stop_words=[w for w in tokens if w not in string.punctuation and w not in stop_words]
return words
I have parsed 30 excel files and created a pandas dataframe. I have tokenized the words, taken out stop words and made bigrams. However when I try to lemmatize it gives me this error: TypeError: unhashable type: 'list'
Here's my code:
# Use simple pre-proces to clean up data and tokenize
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
# Define Function for Removing stopwords
def remove_stopwords(texts):
return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
# Define function for bigrams
def make_bigrams(texts):
return[bigram_mod[doc] for doc in texts]
#Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
#Define function for lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
return WordNetLemmatizer().lemmatize(word)
#Lemmatize words
data_lemmatized = get_lemma(data_words_bigrams)
This is exactly where I get the error. How should I adjust my code to resolve this issue? Thank you in advance
as suggested, the first few lines of the dataframe
df.head()
dataframe snap