I have used the commands which are provided in the spacy document. I followed all the below steps:-
Using the spacy format for creating the model
TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}), ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]})]
Converted the train and dev data in .spacy files using below code:-
import os
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("en_core_web_sm") # load other spacy model
db = DocBin() # create a DocBin object
for text, annot in tqdm(TRAIN_DATA): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skipping entity")
else:
ents.append(span)
doc.ents = ents # label the text with the ents
db.add(doc)
db.to_disk("./train.spacy") # save the docbin object```
Similarly I converted for dev.spacy.
3.Using base spacy configuration file converted it to config.cfg
```python -m spacy init fill-config base_config.cfg config.cfg```
4. Training the model
```python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy```
5. Getting the below output:-[![spacy training output][1]][1]
Please let me know if there is anything I am doing wrong here. Thanks in advance.
[1]: https://i.stack.imgur.com/FfrBX.png
It looks like your data is NER annotations, but your pipeline contains only a tok2vec and parser component. It should contain an NER component. Use the quickstart to generate an NER config and start over from step 3 in your list.
Following the documentation of ?gensim.models.ldamodel, I want to train an ldamodel and (from this SO answer create a worcloud from it). I am using the following code from both sources:
from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary
import gensim
import matplotlib.pyplot as plt
from wordcloud import WordCloud
common_dictionary = Dictionary(common_texts) # create corpus
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]
lda = gensim.models.LdaModel(common_corpus, num_topics=10) # train model on corpus
for t in range(lda.num_topics):
plt.figure()
plt.imshow(WordCloud().fit_words(lda.show_topic(t, 200)))
plt.axis("off")
plt.title("Topic #" + str(t))
plt.show()
However, I get an AttributeError: 'list' object has no attribute 'items' on the line plt.imshow(...)
Can someone help me out here? (Answers to similar questions have not been working for me and I am trying to compile a minimal pipeline with this.)
From the docs, the method WordCloud.fit_words() expects a dictionary as input.
Your error seems to highlight that it's looking for an attribute 'items', typically an attribute of dictionaries, but instead finds a list object.
So the problem is: lda.show_topic(t, 200) returns a list instead of a dictionary. Use dict() to cast it!
Finally:
plt.imshow(WordCloud().fit_words(dict(lda.show_topic(t, 200))))
import time
# cv2.cvtColor takes a numpy ndarray as an argument
import numpy as nm
import pytesseract
# importing OpenCV
import cv2
from PIL import ImageGrab, Image
bboxes = [(1469, 1014, 1495, 1029)]
def imToString():
# Path of tesseract executable
pytesseract.pytesseract.tesseract_cmd = 'D:\Program Files (x86)\Tesseract-OCR' + chr(92) + 'tesseract.exe'
while (True):
for box in bboxes:
# ImageGrab-To capture the screen image in a loop.
# Bbox used to capture a specific area.
cap = ImageGrab.grab(bbox=box)
# Converted the image to monochrome for it to be easily
# read by the OCR and obtained the output String.
tesstr = pytesseract.image_to_string(
cv2.cvtColor(nm.array(cap), cv2.COLOR_BGR2GRAY), lang='eng', config='digits') # ,lang='eng')
cap.show()
#input()
time.sleep(5)
print(tesstr)
# Calling the function
imToString()
It captures an image like this:
It isn't always two digits it can be one or three digits too.
Pytesseract returns values like: asi and oli
So, which Image To Text (OCR) Algorithm should I use for this problem? And, how to use that? I need a very precise value in this example it's 53 so the output should be around 50.
I have parsed 30 excel files and created a pandas dataframe. I have tokenized the words, taken out stop words and made bigrams. However when I try to lemmatize it gives me this error: TypeError: unhashable type: 'list'
Here's my code:
# Use simple pre-proces to clean up data and tokenize
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))
# Define Function for Removing stopwords
def remove_stopwords(texts):
return[[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
# Define function for bigrams
def make_bigrams(texts):
return[bigram_mod[doc] for doc in texts]
#Remove Stop Words
data_words_nostops = remove_stopwords(data_words)
# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)
#Define function for lemmatizing
from nltk.stem.wordnet import WordNetLemmatizer
def get_lemma(word):
return WordNetLemmatizer().lemmatize(word)
#Lemmatize words
data_lemmatized = get_lemma(data_words_bigrams)
This is exactly where I get the error. How should I adjust my code to resolve this issue? Thank you in advance
as suggested, the first few lines of the dataframe
df.head()
dataframe snap
I reckoned that often the answer to my title is to go and read the documentations, but I ran through the NLTK book but it doesn't give the answer. I'm kind of new to Python.
I have a bunch of .txt files and I want to be able to use the corpus functions that NLTK provides for the corpus nltk_data.
I've tried PlaintextCorpusReader but I couldn't get further than:
>>>import nltk
>>>from nltk.corpus import PlaintextCorpusReader
>>>corpus_root = './'
>>>newcorpus = PlaintextCorpusReader(corpus_root, '.*')
>>>newcorpus.words()
How do I segment the newcorpus sentences using punkt? I tried using the punkt functions but the punkt functions couldn't read PlaintextCorpusReader class?
Can you also lead me to how I can write the segmented data into text files?
After some years of figuring out how it works, here's the updated tutorial of
How to create an NLTK corpus with a directory of textfiles?
The main idea is to make use of the nltk.corpus.reader package. In the case that you have a directory of textfiles in English, it's best to use the PlaintextCorpusReader.
If you have a directory that looks like this:
newcorpus/
file1.txt
file2.txt
...
Simply use these lines of code and you can get a corpus:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpusdir = 'newcorpus/' # Directory of corpus.
newcorpus = PlaintextCorpusReader(corpusdir, '.*')
NOTE: that the PlaintextCorpusReader will use the default nltk.tokenize.sent_tokenize() and nltk.tokenize.word_tokenize() to split your texts into sentences and words and these functions are build for English, it may NOT work for all languages.
Here's the full code with creation of test textfiles and how to create a corpus with NLTK and how to access the corpus at different levels:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]
# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
os.mkdir(corpusdir)
# Output the files into the directory.
filename = 0
for text in corpus:
filename+=1
with open(corpusdir+str(filename)+'.txt','w') as fout:
print>>fout, text
# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
assert open(corpusdir+infile,'r').read().strip() == text.strip()
# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
print infile # The fileids of each file.
with newcorpus.open(infile) as fin: # Opens the file.
print fin.read().strip() # Prints the content of the file
print
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()
print
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and
# nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()
print
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()
print
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])
Finally, to read a directory of texts and create an NLTK corpus in another languages, you must first ensure that you have a python-callable word tokenization and sentence tokenization modules that takes string/basestring input and produces such output:
>>> from nltk.tokenize import sent_tokenize, word_tokenize
>>> txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
>>> sent_tokenize(txt1)
['This is a foo bar sentence.', 'And this is the first txtfile in the corpus.']
>>> word_tokenize(sent_tokenize(txt1)[0])
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.']
I think the PlaintextCorpusReader already segments the input with a punkt tokenizer, at least if your input language is english.
PlainTextCorpusReader's constructor
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
You can pass the reader a word and sentence tokenizer, but for the latter the default already is nltk.data.LazyLoader('tokenizers/punkt/english.pickle').
For a single string, a tokenizer would be used as follows (explained here, see section 5 for punkt tokenizer).
>>> import nltk.data
>>> text = """
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries. And sometimes sentences
... can start with non-capitalized words. i is a good variable
... name.
... """
>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
>>> tokenizer.tokenize(text.strip())
>>> import nltk
>>> from nltk.corpus import PlaintextCorpusReader
>>> corpus_root = './'
>>> newcorpus = PlaintextCorpusReader(corpus_root, '.*')
"""
if the ./ dir contains the file my_corpus.txt, then you
can view say all the words it by doing this
"""
>>> newcorpus.words('my_corpus.txt')
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
filecontent1 = "This is a cow"
filecontent2 = "This is a Dog"
corpusdir = 'nltk_data/'
with open(corpusdir + 'content1.txt', 'w') as text_file:
text_file.write(filecontent1)
with open(corpusdir + 'content2.txt', 'w') as text_file:
text_file.write(filecontent2)
text_corpus = PlaintextCorpusReader(corpusdir, ["content1.txt", "content2.txt"])
no_of_words_corpus1 = len(text_corpus.words("content1.txt"))
print(no_of_words_corpus1)
no_of_unique_words_corpus1 = len(set(text_corpus.words("content1.txt")))
no_of_words_corpus2 = len(text_corpus.words("content2.txt"))
no_of_unique_words_corpus2 = len(set(text_corpus.words("content2.txt")))
enter code here