Using SciKit's kMeans to Cluster One's Own Documents - scikit-learn

The SciKit site offers this k-means demo, and I'd like to use as much of it as possible to cluster some of my own documents, since I'm new to both machine learning and SciKit. The problem is getting my documents in a form that fits their demonstration.
Here is the "problem area" from SciKit's example:
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
As can be seen, in the example, the authors use/"fetch" a data set named "20newsgroups," the call for which (according to this page; see the second paragraph of 7.7) "returns a list of the raw text files that can be fed to text feature extractors." I am not relying on a list of "text files" -- as can be seen in my code below -- but I can place my "documents" in whatever form is necessary.
How can I use the SciKit example without having to place my "documents" in text files? Or is it standard practice only to cluster documents from text files rather than the database on which the documents live? It's simply not clear from the demo/documentation what in the example is completely superfluous, used because it made the authors' lives easier, and what isn't. Or at least it's not clear to me.
if cursor.rowcount > 0: #don't bother doing anything if we don't get anything from the database
data = cursor.fetchall()
for row in data:
temp_string = row[0]+" "+row[1]+" "+row[3]+" "+row[4] # currently skipping the event_url: row[2]
page = BeautifulSoup((''.join(temp_string)))
pagetwo = str(page)
clean_text = nltk.clean_html(pagetwo)
tokens = nltk.word_tokenize(clean_text)
fin_doc = "" + "\n"
for word in tokens:
fin_word = stemmer.stem(word).lower()
if fin_word not in stopwords and len(fin_word) > 2:
fin_doc += fin_word + " "
documents.append(fin_doc)

The documents are just a list of strings, one string for each document, iirc.
The documentation is a bit unclear on this one. fetch_20newsgroups downloads the dataset as files, but the representation in the code is the content of the files, not the files themselves.

Related

Whitelist tokens for text generation (XLNet, GPT-2) in huggingface-transformers

In the documentation on text generation (https://huggingface.co/transformers/main_classes/model.html#generative-models) there is the option to put
bad_words_ids (List[int], optional) – List of token ids that are not allowed to be generated. In order to get the tokens of the words that should not appear in the generated text, use tokenizer.encode(bad_word, add_prefix_space=True).
Is there also the option to put something along the lines of "allowed_words_ids"? The idea would be to restrict the language of the generated texts.
I'd also suggest to do what Sahar Mills said. You can do it in the following way.
You get the whole vocab of the model you are using, e.g.
from transformers import AutoTokenizer
# Load tokenizer
checkpoint = "CenIA/distillbert-base-spanish-uncased" #Example model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
vocab = tokenizer.get_vocab()
list(vocab.keys())[:100] # to see the first 100 words
Define words you do want in the model.
words_to_delete = ['forzado', 'vendieron', 'verticales'] # or load them from somewhere else
Define function to create the bad_words_ids, that is, the whole model vocab minus the words you want in the model
def create_bad_words_ids(bad_words_ids, words_to_delete):
for pictogram in range(len(words_to_delete)):
if words_to_delete[pictogram] in bad_words_ids:
bad_words_ids.remove(words_to_delete[pictogram])
return bad_words_ids
bad_words_ids = create_bad_words_ids(bad_words_ids=bad_words_ids, words_to_delete=words_to_delete)
print(bad_words_ids)
Hope it helps,
cheers

Downloading ML annotations in IBM-Watson Knowledge Studio

I am working in an NLP application with WKS, and after training, got a rather low performing results.
I wonder if there is a way to download annotated documents with its entity classification, both for train and test sets, so I can automatically identify in detail, where are the key differences, so I can fix them.
Those that were annotated by humans, can be downloaded in the section "Assets" / "Documents" -> Download Document Sets (button on the right side).
The following Python code, lets you look at the data inside it:
import json
import zipfile
with zipfile.ZipFile(<YOUR DOWNLOADED FILE>, "r") as zip:
with zip.open('documents.json') as arch:
data = arch.read()
documents = json.loads(data)
print(json.dumps(documents,indent=2,separators=(',',':')))
df_documentos = pd.DataFrame(None)
i = 0
for documento in documents:
df_documentos.at[i,'name'] = documento['name']
df_documentos.at[i,'text'] = documento['text']
df_documentos.at[i,'status'] = documento['status']
df_documentos.at[i,'id'] = documento['id']
df_documentos.at[i,'createdDate'] = '{:14.0f}'.format(documento['createdDate'])
df_documentos.at[i,'modifiedDate'] = '{:14.0f}'.format(documento['modifiedDate'])
i += 1
df_documentos
with zipfile.ZipFile(<YOUR DOWNLOADED FILE>, "r") as zip:
with zip.open('sets.json') as arch:
data = arch.read()
sets = json.loads(data)
print(json.dumps(sets,indent=2,separators=(',',':')))
df_sets = pd.DataFrame(None)
i = 0
for set in sets:
df_sets.at[i,'type'] = set['type']
df_sets.at[i,'name'] = set['name']
df_sets.at[i,'count'] = '{:6.0f}'.format(set['count'])
df_sets.at[i,'id'] = set['id']
df_sets.at[i,'createdDate'] = '{:14.0f}'.format(set['createdDate'])
df_sets.at[i,'modifiedDate'] = '{:14.0f}'.format(set['modifiedDate'])
i += 1
df_sets
Then you can iterate to read each one of the JSON files that come into the "gt" folder of the compressed file, and get the detailed sentence splitting, tokenization and annotation.
What I need is being able to download the annotations that resulted from the machine learning model over the TEST documents, which are visible in "Machine Learning Model" / "Performance" / "View Decoding Results".
With this I will be able to identify specific deviations that can lead to revise Type dictionary and annotation criteria.
I am sorry but this feature is not currently available.
You can submit a feature request at the following URL:
https://ibm-data-and-ai.ideas.aha.io/?project=WKS
Thank you.

NLP : Error/Unknown/misspelled text Detection model of a patient's medical text file

I have few patient's medical record text files which i got from the internet and i want to identify/find the files which are bad quality(misspelled words/special characters between the words/Erroneous words) and files with good quality(clean text).i want to build error detection model using text mining/NLP.
1)can someone please help me on the approach and solution for feature extraction and model selection.
2)Is there any medical corpus for medical records to identify the misspelled/Erroneous words.
If your goal is to simply correct these misspelled words to improve performance on whatever downstream task you want to do, then I can suggest a simple approach which has worked sufficiently well for me.
First tokenize your text (I recommend scispacy for medical text)
Identify possible "bad quality" words simply by the count of each unique word constructed from all the words in your corpus e.g. all words that occur <= 3 times
Add words that occur > 3 times in your corpus (we assume these are all correctly spelled) to a regular English dictionary. If your corpus is large, this is perfectly adequate for capturing medical terms. Otherwise use a medical dictionary e.g. UMLS, or https://github.com/glutanimate/wordlist-medicalterms-en to add the medical words not in a regular dictionary
Use pyspellchecker to identify the misspellings by using the Levenshtein Distance algorithm and comparing against our dictionary.
Replace the typos with what pyspellchecker thinks they should be.
A basic example:
import spacy
import scispacy
from collections import Counter
from spellchecker import SpellChecker
nlp = spacy.load('en_core_sci_md') # sciSpaCy
word_freq = Counter()
for doc in corpus:
tokens = nlp.tokenizer(doc)
tokenised_text = ""
for token in tokens:
tokenised_text = tokenised_text + token.text + " "
word_freq.update(tokenised_text.split())
infreq_words = [word for word in word_freq.keys() if word_freq[word] <= 3 and word[0].isdigit() == False]
freq_words = [word for word in word_freq.keys() if word_freq[word] > 3]
add_to_dictionary = " ".join(freq_words)
f=open("medical_dict.txt", "w+")
f.write(add_to_dictionary)
f.close()
spell = SpellChecker()
spell.distance = 1 # set the distance parameter to just 1 edit away - much quicker
spell.word_frequency.load_text_file('medical_dict.txt')
misspelled = spell.unknown(infreq_words)
misspell_dict = {}
for i, word in enumerate(misspelled):
if (word != spell.correction(word)):
misspell_dict[word] = spell.correction(word)
print(list(misspell_dict.items())[:10])
I would also recommend using regular expressions to fix any other "bad quality" words which can be systematically corrected.
You can do biobert to do contextual spelling check,
Link: https://github.com/dmis-lab/biobert

Gensim-python: Is there a simple way to get the number of times a given token arise in all documents?

My gensim model is like this:
class MyCorpus(object):
parametersList = []
def __init__(self,dictionary):
self.dictionary=dictionary
def __iter__(self):
#for line in open('mycorpus.txt'):
for line in texts:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line[0].lower().split())
if __name__=="__main__":
texts=[['human human interface computer'],
['survey user user computer system system system response time'],
['eps user interface system'],
['system human system eps'],
['user response time'],
['trees'],
['graph trees'],
['graph minors trees'],
['graph minors minors survey survey survey']]
dictionary = corpora.Dictionary(line[0].lower().split() for line in texts)
corpus= MyCorpus(dictionary)
The frequency of each token in each document is automatically evaluated.
I also can define the tf-idf model and access the tf-idf statistic for each token in each document.
model = TfidfModel(corpus)
However, I have no clue how to count (memory-friendly) the number of documents that a given word arise. How can I do that? [Sure... I can use the values of tf-idf and document frequency to evaluate it... However, I would like to evaluate it directly from some counting process]
For instance, for the first document, I would like to get somenthing like
[('human',2), ('interface',2), ('computer',2)]
since each token above arises twice in each document.
For the second.
[('survey',2), ('user',3), ('computer',2),('system',3), ('response',2),('time',2)]
How about this?
from collections import Counter
documents = [...]
count_dict = [word_count(document) for filename in documents]
total = sum(count_dict, Counter())
I assumed that all your string are different documents/files. You can make related changes. Also, made change to the code.

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Resources