Downloading ML annotations in IBM-Watson Knowledge Studio - nlp

I am working in an NLP application with WKS, and after training, got a rather low performing results.
I wonder if there is a way to download annotated documents with its entity classification, both for train and test sets, so I can automatically identify in detail, where are the key differences, so I can fix them.
Those that were annotated by humans, can be downloaded in the section "Assets" / "Documents" -> Download Document Sets (button on the right side).
The following Python code, lets you look at the data inside it:
import json
import zipfile
with zipfile.ZipFile(<YOUR DOWNLOADED FILE>, "r") as zip:
with zip.open('documents.json') as arch:
data = arch.read()
documents = json.loads(data)
print(json.dumps(documents,indent=2,separators=(',',':')))
df_documentos = pd.DataFrame(None)
i = 0
for documento in documents:
df_documentos.at[i,'name'] = documento['name']
df_documentos.at[i,'text'] = documento['text']
df_documentos.at[i,'status'] = documento['status']
df_documentos.at[i,'id'] = documento['id']
df_documentos.at[i,'createdDate'] = '{:14.0f}'.format(documento['createdDate'])
df_documentos.at[i,'modifiedDate'] = '{:14.0f}'.format(documento['modifiedDate'])
i += 1
df_documentos
with zipfile.ZipFile(<YOUR DOWNLOADED FILE>, "r") as zip:
with zip.open('sets.json') as arch:
data = arch.read()
sets = json.loads(data)
print(json.dumps(sets,indent=2,separators=(',',':')))
df_sets = pd.DataFrame(None)
i = 0
for set in sets:
df_sets.at[i,'type'] = set['type']
df_sets.at[i,'name'] = set['name']
df_sets.at[i,'count'] = '{:6.0f}'.format(set['count'])
df_sets.at[i,'id'] = set['id']
df_sets.at[i,'createdDate'] = '{:14.0f}'.format(set['createdDate'])
df_sets.at[i,'modifiedDate'] = '{:14.0f}'.format(set['modifiedDate'])
i += 1
df_sets
Then you can iterate to read each one of the JSON files that come into the "gt" folder of the compressed file, and get the detailed sentence splitting, tokenization and annotation.
What I need is being able to download the annotations that resulted from the machine learning model over the TEST documents, which are visible in "Machine Learning Model" / "Performance" / "View Decoding Results".
With this I will be able to identify specific deviations that can lead to revise Type dictionary and annotation criteria.

I am sorry but this feature is not currently available.
You can submit a feature request at the following URL:
https://ibm-data-and-ai.ideas.aha.io/?project=WKS
Thank you.

Related

How to change only array for Dicom file with Simple ITK in python

I have a bunch of medical images in dicom that I want to correct for bias field inhomogeneity using SimpleITK in Python. The workflow is straightforward: I want to (1) open the dicom image, (2) create a binary mask of the object in the image, (3) apply N4 bias field correction to the masked image, (4) write back the corrected image in dicom format. Note that no spatial transformation is applied to the image, but only intensity transformation, so that I could copy all spatial information and all meta data (except for date/hour of creation and instance number) from the original to the corrected image.
I have written this function to achieve my goal:
def n4_dcm_correction(dcm_in_file):
metadata_to_set = ["0008|0012", "0008|0013", "0020|0013"]
filepath = PurePath(dcm_in_file)
root_dir = str(filepath.parent)
file_name = filepath.stem
dcm_reader = sitk.ImageFileReader()
dcm_reader.SetFileName(dcm_in_file)
dcm_reader.LoadPrivateTagsOn()
inputImage = dcm_reader.Execute()
metadata_to_copy = [k for k in inputImage.GetMetaDataKeys() if k not in metadata_to_set]
maskImage = sitk.OtsuThreshold(inputImage,0,1,200)
filledImage = sitk.BinaryFillhole(maskImage)
floatImage = sitk.Cast(inputImage,sitk.sitkFloat32)
corrector = sitk.N4BiasFieldCorrectionImageFilter();
output = corrector.Execute(floatImage, filledImage)
output.CopyInformation(inputImage)
for k in metadata_to_copy:
print("key is: {}; value is {}".format(k, inputImage.GetMetaData(k)))
output.SetMetaData(k, inputImage.GetMetaData(k))
output.SetMetaData("0008|0012", time.strftime("%Y%m%d"))
output.SetMetaData("0008|0013", time.strftime("%H%M%S"))
output.SetMetaData("0008|0013", str(float(inputImage.GetMetaData("0008|0013")) + randint(1, 999)))
out_file = "{}/{}_biascorrected.dcm".format(root_dir, file_name)
writer = sitk.ImageFileWriter()
writer.KeepOriginalImageUIDOn()
writer.SetFileName(out_file)
writer.Execute(sitk.Cast(output, sitk.sitkUInt16))
return
n4_dcm_correction("/path/to/my/dcm/image.dcm")
As much as the bias correction part works (the bias is removed), the writing part is a mess. I would expect my output dicom to have the exact same metadata of the original one, however they are all missing, notably the patient name, the protocol name and the manufacturer. Similalry, something is very wrong with the spatial information, since if I try to convert the dicom to the nifti format with dcm2niix, the directions are reversed: superior is down and inferior is up, forward is back and backward is front. What step am I missing ?
I suspect you are working with a MRI series, not a single file. Likely this example does what you want, read-modify-write a volume stored in a set of files.
If the example did not resolve your issue, please post to the ITK discourse which is the primary location for ITK/SimpleITK related discussions.

How to use mozilla deepspeech to convert speech to text using it's pre-trained model?

I want to convert speech to text using mozilla deepspeech. But the output is really bad.
I have downloaded mozilla's pre trained model and then what i have done is this:
BEAM_WIDTH = 500
LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.10
N_FEATURES = 26
N_CONTEXT = 9
ds = Model(model, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)
fs,audio = wav.read(path)
data = audio[:,0] ## changing to mono channel (using only one channel)
prediction = ds.stt(data,fs)
print(test)
print(prediction)
Now the output is nowhere near to my audio sample. What do i have to do to increase it's accuracy?
I assume it's because you are not including any LanguageModel.
The pre-trained model is basically just the acoustic model which will only transcribe the audio to similar sounding text that may not make sense.
If you combine the acoustic model with a language model (LM) you will likely get better results.
In your code example I can see the Parameter LM_WEIGHT but not any refenrence to the LM itself.
I'm unsure in which Language you want to integrate deepspeech but here is the example for node-js. This is the part where the LM is integrated
const LM_ALPHA = 0.75;
const LM_BETA = 1.85;
let lmPath = './models/lm.binary';
let triePath = './models/trie';
model.enableDecoderWithLM(lmPath, triePath, LM_ALPHA, LM_BETA);
If I'm not mistaken, the LM & Trie file is included in the pre-trained download ZIP
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/deepspeech-0.5.1-models.tar.gz
Otherwise you can also create your own Language Model which would make sense if you only need the Model to recognize specific words.

How to use extract the hidden layer features in H2ODeepLearningEstimator?

I found H2O has the function h2o.deepfeatures in R to pull the hidden layer features
https://www.rdocumentation.org/packages/h2o/versions/3.20.0.8/topics/h2o.deepfeatures
train_features <- h2o.deepfeatures(model_nn, train, layer=3)
But I didn't find any example in Python? Can anyone provide some sample code?
Most Python/R API functions are wrappers around REST calls. See http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/model_base.html#ModelBase.deepfeatures
So, to convert an R example to a Python one, move the model to be the this, and all other args should shuffle along. I.e. the example from the manual becomes (with dots in variable names changed to underlines):
prostate_hex = ...
prostate_dl = ...
prostate_deepfeatures_layer1 = prostate_dl.deepfeatures(prostate_hex, 1)
prostate_deepfeatures_layer2 = prostate_dl.deepfeatures(prostate_hex, 2)
Sometimes the function name will change slightly (e.g. h2o.importFile() vs. h2o.import_file() so you need to hunt for it at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html

Doc2Vec.infer_vector keeps giving different result everytime on a particular trained model

I am trying to follow the official Doc2Vec Gensim tutorial mentioned here - https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
I modified the code in line 10 to determine best matching document for the given query and everytime I run, I get a completely different resultset. My new code iin line 10 of the notebook is:
inferred_vector = model.infer_vector(['only', 'you', 'can', 'prevent', 'forest', 'fires'])
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
rank = [docid for docid, sim in sims]
print(rank)
Everytime I run the piece of code, I get different set of documents that are matching with this query: "only you can prevent forest fires". The difference is stark and just does not seem to match.
Is Doc2Vec not a suitable match for querying and information extraction? Or are there bugs?
Look into the code, in infer_vector you are using parts of the algorithm that is non-deterministic. Initialization of word vector is deterministic - see the code of seeded_vector, but when we look further, i.e., random sampling of words, negative sampling (updating only sample of word vector per iteration) could cause non-deterministic output (thanks #gojomo).
def seeded_vector(self, seed_string):
"""Create one 'random' vector (but deterministic by seed_string)"""
# Note: built-in hash() may vary by Python version or even (in Py3.x) per launch
once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
return (once.rand(self.vector_size) - 0.5) / self.vector_size
Set negative=0 to avoid randomization:
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [list('asdf'), list('asfasf')]
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(documents)]
model = Doc2Vec(documents, vector_size=20, window=5, min_count=1, negative=0, workers=6, epochs=10)
a = list('test sample')
b = list('testtesttest')
for s in (a, b):
v1 = model.infer_vector(s)
for i in range(100):
v2 = model.infer_vector(s)
assert np.all(v1 == v2), "Failed on %s" % (''.join(s))

Using SciKit's kMeans to Cluster One's Own Documents

The SciKit site offers this k-means demo, and I'd like to use as much of it as possible to cluster some of my own documents, since I'm new to both machine learning and SciKit. The problem is getting my documents in a form that fits their demonstration.
Here is the "problem area" from SciKit's example:
dataset = fetch_20newsgroups(subset='all', categories=categories, shuffle=True, random_state=42)
labels = dataset.target
true_k = np.unique(labels).shape[0]
As can be seen, in the example, the authors use/"fetch" a data set named "20newsgroups," the call for which (according to this page; see the second paragraph of 7.7) "returns a list of the raw text files that can be fed to text feature extractors." I am not relying on a list of "text files" -- as can be seen in my code below -- but I can place my "documents" in whatever form is necessary.
How can I use the SciKit example without having to place my "documents" in text files? Or is it standard practice only to cluster documents from text files rather than the database on which the documents live? It's simply not clear from the demo/documentation what in the example is completely superfluous, used because it made the authors' lives easier, and what isn't. Or at least it's not clear to me.
if cursor.rowcount > 0: #don't bother doing anything if we don't get anything from the database
data = cursor.fetchall()
for row in data:
temp_string = row[0]+" "+row[1]+" "+row[3]+" "+row[4] # currently skipping the event_url: row[2]
page = BeautifulSoup((''.join(temp_string)))
pagetwo = str(page)
clean_text = nltk.clean_html(pagetwo)
tokens = nltk.word_tokenize(clean_text)
fin_doc = "" + "\n"
for word in tokens:
fin_word = stemmer.stem(word).lower()
if fin_word not in stopwords and len(fin_word) > 2:
fin_doc += fin_word + " "
documents.append(fin_doc)
The documents are just a list of strings, one string for each document, iirc.
The documentation is a bit unclear on this one. fetch_20newsgroups downloads the dataset as files, but the representation in the code is the content of the files, not the files themselves.

Resources