I am trying to do custom NER using spacy for articles; but when I start to train the model I get the error saying,
"[E088] Text of length 1021312 exceeds maximum of 1000000...."
Tried the following solutions :
i. Increasing nlp.max_length = 1500000
ii.Used spacy "en_core_web_lg" and then disabled the irrelevant spacy nlp pipelines.
iii. Tried nlp.max_length = len(txt) + 5000
iv. changing the max_lenghth in config file
v.nlp.max_length = len(text)
vi.splitting the data with '\n' and rejoined with space. (new line character ā\nā is used to create a new line.)
doc = nlp.make_doc(" ".join(text.split('\n')))
But all in vain.
I use the code below. It worked for me.
import spacy
nlp = spacy.load('en_core_web_lg',disable=['parser', 'tagger','ner'])
nlp.max_length = 1198623
Related
I'm currently using nltk and spacy's lemmatiser (both accuracy and large datasets used) to test. But they are simply not able to lemmatize slangs words in dictionary properly.
import spacy
!python3 -m spacy download en_core_web_lg
nlp = spacy.load("en_core_web_lg", disable=['parser', 'ner'])
doc = nlp("Hello I am movin to the target so i can roamin around")
for token in doc:
print(token, token.lemma_)
Hello hello
I I
am be
movin movin
to to
the the
target target
so so
i I
can can
roamin roamin
around around```
This is my output but I would want
movin -> move
roamin -> roam
thank you!
I have been trying to create a vocab file in a nlp task to use in tokenize method of trax to tokenize the word but i can't find which module/library to use to create the *.subwords file. Please help me out?
The easiest way to use the trax.data.Tokenize with your own data and a subword vocabulary it's using Google's Sentencepiece python module
import sentencepiece as spm
spm.SentencePieceTrainer.train('--input=data/my_data.csv --model_type=bpe --model_prefix=my_model --vocab_size=32000')
This creates two files:
my_model.model
my_model.vocab
We'll use this model in trax.data.Tokenize and we'll add the parameter vocab_type with the value "sentencepiece"
trax.data.Tokenize(vocab_dir='vocab/', vocab_file='my_model.model', vocab_type='sentencepiece')
I think it's the best way since you can load the model and use it to get control ids while avoiding hardcode
sp = spm.SentencePieceProcessor()
sp.load('my_model.model')
print('bos=sp.bos_id()=', sp.bos_id())
print('eos=sp.eos_id()=', sp.eos_id())
print('unk=sp.unk_id()=', sp.unk_id())
print('pad=sp.pad_id()=', sp.pad_id())
sentence = "hello world"
# encode: text => id
print("Pieces: ", sp.encode_as_pieces(sentence))
print("Ids: ", sp.encode_as_ids(sentence))
# decode: id => text
print("Decode Pieces: ", sp.decode_pieces(sp.encode_as_pieces(sentence)))
print("Decode ids: ", sp.decode_ids(sp.encode_as_ids(sentence)))
print([sp.bos_id()] + sp.encode_as_ids(sentence) + [sp.eos_id()])
If still you want to have a subword file, try this:
python trax/data/text_encoder_build_subword.py \
--corpus_filepattern=data/data.txt --corpus_max_lines=40000 \
--output_filename=data/my_file.subword
I hope this can help since there is no clear literature to see how to create compatible subword files out there
You can use tensorflow API SubwordTextEncoder
Use following code snippet -
encoder = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
(text_row for text_row in text_dataset), target_vocab_size=2**15)
encoder.save_to_file(vocab_fname)
Tensorflow will append .subwords extension to above vocab file.
I am following the code from here.
I have a csv file of 8000 questions and answers and I have made an LSI model with 1000 topics, from a tfidf corpus, using gensim as follows. I only consider questions as part of the text not the answers.
texts = [jieba.lcut(text) for text in document]
# tk = WhitespaceTokenizer()
# texts = [tk.tokenize(text) for text in document]
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
tfidf_corpus = tfidf[corpus]
lsi_model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=1000)
corpus_lsi = lsi_model[tfidf_corpus]
index = similarities.SparseMatrixSimilarity(corpus_lsi, num_features = feature_cnt)
Before this I am also preprocessing the data by removing stopwords using nltk and replacing punctuations using regex and lemmatizing, using wordnet and nltk.
I understand that jieba is not a tokenizer suited for english because it tokenizes spaces as well like this:
Sample: This is untokenized text
Tokenized: 'This',' ','is',' ','untokenized', ' ', 'text'
When I switch from jieba to nltk whitespace tokenizer, strange thing happens, my accuracy suddenly drops that is when I a new sentence using the following code I get worse results
keyword = "New sentence the similarity of which is to be found to the main corpus"
kw_vector = dictionary.doc2bow(jieba.lcut(keyword)) # jieba.lcut can be replaced by tk.tokenize()
sim = index[lsi_model[tfidf[kw_vector]]]
x = [sim[i] for i in np.argsort(sim)[-2:]]
My understanding is that extra and useless words and characters like whitespaces should decrease accuracy but here I observe an opposite effect. What could be the possible reasons?
One possible explanation I came up with is that most of the questions are short only 5 to 6 words like
What is the office address?
Who to contact for X?
Where to find document Y?
I have few patient's medical record text files which i got from the internet and i want to identify/find the files which are bad quality(misspelled words/special characters between the words/Erroneous words) and files with good quality(clean text).i want to build error detection model using text mining/NLP.
1)can someone please help me on the approach and solution for feature extraction and model selection.
2)Is there any medical corpus for medical records to identify the misspelled/Erroneous words.
If your goal is to simply correct these misspelled words to improve performance on whatever downstream task you want to do, then I can suggest a simple approach which has worked sufficiently well for me.
First tokenize your text (I recommend scispacy for medical text)
Identify possible "bad quality" words simply by the count of each unique word constructed from all the words in your corpus e.g. all words that occur <= 3 times
Add words that occur > 3 times in your corpus (we assume these are all correctly spelled) to a regular English dictionary. If your corpus is large, this is perfectly adequate for capturing medical terms. Otherwise use a medical dictionary e.g. UMLS, or https://github.com/glutanimate/wordlist-medicalterms-en to add the medical words not in a regular dictionary
Use pyspellchecker to identify the misspellings by using the Levenshtein Distance algorithm and comparing against our dictionary.
Replace the typos with what pyspellchecker thinks they should be.
A basic example:
import spacy
import scispacy
from collections import Counter
from spellchecker import SpellChecker
nlp = spacy.load('en_core_sci_md') # sciSpaCy
word_freq = Counter()
for doc in corpus:
tokens = nlp.tokenizer(doc)
tokenised_text = ""
for token in tokens:
tokenised_text = tokenised_text + token.text + " "
word_freq.update(tokenised_text.split())
infreq_words = [word for word in word_freq.keys() if word_freq[word] <= 3 and word[0].isdigit() == False]
freq_words = [word for word in word_freq.keys() if word_freq[word] > 3]
add_to_dictionary = " ".join(freq_words)
f=open("medical_dict.txt", "w+")
f.write(add_to_dictionary)
f.close()
spell = SpellChecker()
spell.distance = 1 # set the distance parameter to just 1 edit away - much quicker
spell.word_frequency.load_text_file('medical_dict.txt')
misspelled = spell.unknown(infreq_words)
misspell_dict = {}
for i, word in enumerate(misspelled):
if (word != spell.correction(word)):
misspell_dict[word] = spell.correction(word)
print(list(misspell_dict.items())[:10])
I would also recommend using regular expressions to fix any other "bad quality" words which can be systematically corrected.
You can do biobert to do contextual spelling check,
Link: https://github.com/dmis-lab/biobert
I'm trying to do data enhancement with a FAQ dataset. I change words, specifically nouns, by most similar words with Wordnet checking the similarity with Spacy. I use multiple for loop to go through my dataset.
import spacy
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
nlp = spacy.load('en_core_web_md')
nltk.download('wordnet')
questions = pd.read_csv("FAQ.csv")
list_questions = []
for question in questions.values:
list_questions.append(nlp(question[0]))
for question in list_questions:
for token in question:
treshold = 0.5
if token.pos_ == 'NOUN':
wordnet_syn = wn.synsets(str(token), pos=wn.NOUN)
for syn in wordnet_syn:
for lemma in syn.lemmas():
similar_word = nlp(lemma.name())
if similar_word.similarity(token) != 1. and similar_word.similarity(token) > treshold:
good_word = similar_word
treshold = token.similarity(similar_word)
However, the following warning is printed several times and I don't understand why :
UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
It is my similar_word.similarity(token) which creates the problem but I don't understand why.
The form of my list_questions is :
list_questions = [Do you have a paper or other written explanation to introduce your model's details?, Where is the BERT code come from?, How large is a sentence vector?]
I need to check token but also the similar_word in the loop, for example, I still get the error here :
tokens = nlp(u'dog cat unknownword')
similar_word = nlp(u'rabbit')
if(similar_word):
for token in tokens:
if (token):
print(token.text, similar_word.similarity(token))
You get that error message when similar_word is not a valid spacy document. E.g. this is a minimal reproducible example:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat')
#similar_word = nlp(u'rabbit')
similar_word = nlp(u'')
for token in tokens:
print(token.text, similar_word.similarity(token))
If you change the '' to be 'rabbit' it works fine. (Cats are apparently just a fraction more similar to rabbits than dogs are!)
(UPDATE: As you point out, unknown words also trigger the warning; they will be valid spacy objects, but not have any word vector.)
So, one fix would be to check similar_word is valid, including having a valid word vector, before calling similarity():
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat')
similar_word = nlp(u'')
if(similar_word and similar_word.vector_norm):
for token in tokens:
if(token and token.vector_norm):
print(token.text, similar_word.similarity(token))
Alternative Approach:
You could suppress the particular warning. It is W008. I believe setting an environmental variable SPACY_WARNING_IGNORE=W008 before running your script would do it. (Not tested.)
(See source code)
By the way, similarity() might cause some CPU load, so is worth storing in a variable, instead of calculating it three times as you currently do. (Some people might argue that is premature optimization, but I think it might also make the code more readable.)
I have suppress the W008 warning by setting environmental variable by using this code in run file.
import os
app = Flask(__name__)
app.config['SPACY_WARNING_IGNORE'] = "W008"
os.environ["SPACY_WARNING_IGNORE"] = "W008"
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)