Watson Speech to text - custom language model training error - Total number of OOV words exceeds 30000 - speech-to-text

I am getting training error for my custom language model.
I added < 30K words. I verified the count of OOV words -
{"corpora": [{
"out_of_vocabulary_words": 73,
"total_words": 5034,
"name": "ABC",
"status": "analyzed"
}]}
curl -X GET -u "myusername":"mypassword" "https://stream.watsonplatform.net/speech-to-text/api/v1/customizations/my-customization-id/words" > OOV.txt
I have 21,861 words in OOV.txt.
Now, when I train the language model, I get the following error -
{
"code": 400,
"code_description": "Bad Request",
"error": "Total number of OOV words 128858 exceeds 30000"
}
Please help.

We fixed this issue with Speech-to-Text Customization in Watson long ago.

Related

Natural Nodejs NLP classifier gives heap dump when training more than 2000 data points

We are using natural nodejs library to classify our users queries in the following way:
const natural = require("natural");
const cls = new natural.LogisticRegressionClassifier();
cls.addDocument("book", "stationary");
cls.addDocument("pen", "stationary");
...
.....
....
5000 + data points
cls.addDocument("last one", "last one");
cls.train(); ----- But This gives heap error and the program crashes.
pino.info("StockNames training successfully completed.");
The train() functions works fine when dataset size is under few hundreds but throwing heap errors and crashing when dataset size is in few thousands. Any suggestions please help. Thanks

MLflow webserver returns 400 status, "Incompatible input types for column X. Can not safely convert float64 to <U0."

I am implementing an anomaly detection web service using MLflow and sklearn.pipeline.Pipeline(). The aim of the model is to detect web crawlers using server log and response_length column is one of my features. After serving model, for testing the web service I send below request that contains the 20 first columns of the train data.
$ curl --location --request POST '127.0.0.1:8000/invocations'
--header 'Content-Type: text/csv' \
--data-binary 'datasets/test.csv'
But response of the web server has status code 400 (BAD REQUEST) and this JSON body:
{
"error_code": "BAD_REQUEST",
"message": "Incompatible input types for column response_length. Can not safely convert float64 to <U0."
}
Here is the model compilation MLflow Tracking component log:
[Pipeline] ......... (step 1 of 3) Processing transform, total=11.8min
[Pipeline] ............... (step 2 of 3) Processing pca, total= 4.8s
[Pipeline] ........ (step 3 of 3) Processing rule_based, total= 0.0s
2021/07/16 04:55:12 WARNING mlflow.sklearn: Training metrics will not be recorded because training labels were not specified. To automatically record training metrics, provide training labels as inputs to the model training function.
2021/07/16 04:55:12 WARNING mlflow.utils.autologging_utils: MLflow autologging encountered a warning: "/home/matin/workspace/Rahnema College/venv/lib/python3.8/site-packages/mlflow/models/signature.py:129: UserWarning: Hint: Inferred schema contains integer column(s). Integer columns in Python cannot represent missing values. If your input data contains missing values at inference time, it will be encoded as floats and will cause a schema enforcement error. The best way to avoid this problem is to infer the model schema based on a realistic data sample (training dataset) that includes missing values. Alternatively, you can declare integer columns as doubles (float64) whenever these columns may have missing values. See `Handling Integers With Missing Values <https://www.mlflow.org/docs/latest/models.html#handling-integers-with-missing-values>`_ for more details."
Logged data and model in run: 8843336f5c31482c9e246669944b1370
---------- logged params ----------
{'memory': 'None',
'pca': 'PCAEstimator()',
'rule_based': 'RuleBasedEstimator()',
'steps': "[('transform', <log_transformer.LogTransformer object at "
"0x7f05a8b95760>), ('pca', PCAEstimator()), ('rule_based', "
'RuleBasedEstimator())]',
'transform': '<log_transformer.LogTransformer object at 0x7f05a8b95760>',
'verbose': 'True'}
---------- logged metrics ----------
{}
---------- logged tags ----------
{'estimator_class': 'sklearn.pipeline.Pipeline', 'estimator_name': 'Pipeline'}
---------- logged artifacts ----------
['model/MLmodel',
'model/conda.yaml',
'model/model.pkl',
'model/requirements.txt']
Could anyone tell me exactly how I can fix this model serve problem?
The problem caused by mlflow.utils.autologging_utils WARNING.
When the model is created, data input signature is saved on the MLmodel file with some.
You should change response_length signature input type from string to double by replacing
{"name": "response_length", "type": "double"}
instead of
{"name": "response_length", "type": "string"}
so it doesn't need to be converted. After serving the model with edited MLmodel file, the web server worked as expected.

Change trialId in Google AI Platform hyperparameter tuning

I'm trying to follow this tutorial on hyperparameter tuning on AI Platform: https://cloud.google.com/blog/products/gcp/hyperparameter-tuning-on-google-cloud-platform-is-now-faster-and-smarter.
My configuration yaml file looks like this:
trainingInput:
hyperparameters:
goal: MINIMIZE
hyperparameterMetricTag: loss
maxTrials: 4
maxParallelTrials: 2
params:
- parameterName: learning_rate
type: DISCRETE
discreteValues:
- 0.0005
- 0.001
- 0.0015
- 0.002
The expected output:
"completedTrialCount": "4",
"trials": [
{
"trialId": "3",
"hyperparameters": {
"learning_rate": "2e-03"
},
"finalMetric": {
"trainingStep": "123456",
"objectiveValue": 0.123456
},
},
Is there any way to customize the trialId instead the defaults numeric values (e.g. 1,2,3,4...)?
It is not possible to customize the trialId as it is dependent on the parameter maxTrials in your hyperparameter tuning config.
maxTrials only accepts integers, so the assigned value to trialId will be a range from 1 to your defined maxTrials.
Also as mentioned in the example in your post where maxTrials: 40 is set and it yields a json that shows trialId: 35 which is within the range of maxTrials.
This indicates that 40 trials have been completed, and the best so far
is trial 35, which achieved an objective of 1.079 with the
hyperparameter values of nembeds=18 and nnsize=32.
Example output:

Need to get actor name out of the Json file

i want to get actor name out of this json file page_title and then match this with database i tried using nltk and spacy but there i have to train data. Do i have train for each and ever sentence i have more than 100k sentences. If i sit to train data it will takes a month or more. Is there any way that i can dump K_actor database to train spacy, nltk or any other way.
{"page_title": "Sonakshi Sinha To Auction Sketch Of Buddha To Help Migrant Labourers", "description": "Sonakshi Sinha took to Instagram to share a timelapse video of a sketch of Buddha that she made to auction to raise funds for migrant workers affected by Covid-19 crisis. ", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589815261_1589815196489_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/sonakshi-sinha-to-auction-sketch-of-buddha-to-help-migrant-labourers-2626123.html"}
{"page_title": "Anushka Sharma Calls Virat Kohli 'A Liar' on IG Live, Nushrat Bharucha Gets Propositioned on Twitter", "description": "In an Instagram live interaction with Sunil Chhetri, Virat Kohli was left embarrassed after Anushka Sharma called him a 'jhootha' from behind the camera. This and more in today's wrap.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589813980_1589813933996_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/anushka-sharma-calls-virat-kohli-a-liar-on-ig-live-nushrat-bharucha-gets-propositioned-on-twitter-2626093.html"}
{"page_title": "Ranveer Singh Shares a Throwback to the Days When WWF was His Life", "description": "Ranveer Singh shared a throwback picture from his childhood where he could be seen posing in front of a poster of WWE legend Hulk Hogan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812401_screenshot_20200518-195906_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/ranveer-singh-shares-a-throwback-to-the-days-when-wwf-was-his-life-2626067.html"}
{"page_title": "Salman Khan's Love Song 'Tere Bina' Gets 26 Million Views", "description": "Salman Khan's song Tere Bina, which was launched a few days ago, had garnered 12 million views within 24 hours. As it continues to trend, it has garnered 26 million views in less than a week.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589099778_screenshot_20200510-135934_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/salman-khans-love-song-tere-bina-gets-26-million-views-2626077.html"}
{"page_title": "Yash And Radhika Pandit Pose With Their Kids For a Perfect Family Picture", "description": "Kannada actor Yash tied the knot with actress Radhika Pandit in 2016. The couple shares two kids together.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589812187_yash.jpg", "post_url": "https://www.news18.com/news/movies/yash-and-radhika-pandit-pose-with-their-kids-for-a-perfect-family-picture-2626055.html"}
{"page_title": "Malaika Arora Shares Beach Vacay Boomerang With Hopeful Note", "description": "Malaika Arora shared a throwback boomerang from a beach vacation where she could be seen playfully spinning. She also shared a hopeful message along with it.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589810291_screenshot_20200518-192603_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/malaika-arora-shares-beach-vacay-boomerang-with-hopeful-note-2626019.html"}
{"page_title": "Actor Nawazuddin Siddiqui's Wife Aaliya Sends Legal Notice To Him Demanding Divorce, Maintenance", "description": "The notice was sent to the ", "image_url": "https://images.news18.com/ibnlive/uploads/2019/10/Nawazuddin-Siddiqui.jpg", "post_url": "https://www.news18.com/news/movies/actor-nawazuddin-siddiquis-wife-aaliya-sends-legal-notice-to-him-demanding-divorce-maintenance-2626035.html"}
{"page_title": "Lisa Haydon Celebrates Son Zack\u2019s 3rd Birthday With Homemade Cake And 'Spiderman' Surprise", "description": "Lisa Haydon took to Instagram to share some glimpses from the special day. In the pictures, we can spot a man wearing a Spiderman costume.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807960_lisa-rey.jpg", "post_url": "https://www.news18.com/news/movies/lisa-haydon-celebrates-son-zacks-3rd-birthday-with-homemade-cake-and-spiderman-surprise-2625953.html"}
{"page_title": "Chiranjeevi Recreates Old Picture with Wife, Says 'Time Has Changed'", "description": "Chiranjeevi was last seen in historical-drama Sye Raa Narasimha Reddy. He was shooting for his next film, Acharya, before the coronavirus lockdown.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589808242_pjimage.jpg", "post_url": "https://www.news18.com/news/movies/chiranjeevi-recreates-old-picture-with-wife-says-time-has-changed-2625973.html"}
{"page_title": "Amitabh Bachchan, Rishi Kapoor\u2019s Pout Selfie Recreated By Abhishek, Ranbir is Priceless", "description": "A throwback picture that has gone viral on the internet shows Ranbir Kapoor and Abhishek Bachchan recreating a selfie of their fathers Rishi Kapoor and Amitabh Bachchan.", "image_url": "https://images.news18.com/ibnlive/uploads/2020/05/1589807772_screenshot_20200518-184521_chrome_copy_875x583.jpg", "post_url": "https://www.news18.com/news/movies/amitabh-bachchan-rishi-kapoors-pout-selfie-recreated-by-abhishek-ranbir-is-priceless-2625867.html"}
Something that you can do is to create an annoter script wherein you can replace actor names with '###' or some other string (which will be replaced later with actor names (entities) for training).
I trained 68K data/sentences in 9 hrs with my i3 laptop. You can dump data like this and the output file can be used for training the model.
That will save time and also give you ready made training data format for SpaCy.
from nltk import word_tokenize
from pandas import read_csv
import re
import os.path
def annot(Label, entity, textlist) :
finaldict = []
for text_token in textlist:
textbk=text_token
for value in entity:
#if entity has multi tokens
text=textbk
text=text_token
text=str(text).replace('###',value)
text=text.lower()
text = re.sub('[^a-zA-Z0-9\n\.]',' ', text)
if len(word_tokenize(value))<2:
#print('I am here')
newtext=word_tokenize(text)
traindata=[]
prev_length=0
prev_pos=0
k=0
while k != len(newtext):
if k == 0:
prev_pos=0
prev_length=len(newtext[k])
if value.lower()== str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
else :
prev_pos=prev_length+1
prev_length=prev_length+len(newtext[k])+1
if value.lower()==str(newtext[k]):
ent=Label
tup=(prev_pos,prev_length,ent)
traindata.append(tup)
else:
pass
k=k+1
mydict={'entities':traindata}
finaldict.append((text,mydict))
else:
traindata=[]
try:
begin=text.index(value.lower())
ent=Label
tup=(begin,len(value.lower()),ent)
traindata.append(tup)
except ValueError:
pass
mydict={'entities':traindata}
finaldict.append((text,mydict))
return finaldict
def getEntities(csv_file, column) :
df = read_csv(csv_file)
return df[column].to_list()
def getSentences(file_name) :
with open(file_name) as file1 :
sentences = [line1.rstrip('\n') for line1 in file1]
return sentences
def saveData (data, filename, path) :
filename = os.path.join(path, filename)
with open(filename, 'a') as file :
for sent in data :
file.write("{}\n".format(sent))
ents = getEntities(csv_file, column_name) #Actor names in your case
entities = [ent for ent in ents if str(ent) != 'nan']
sentences = getSentences(filepathandname) #Considering you have the sentences in a text file
label = 'ACTOR_NAMES'
data = annot(label, entities, sentences)
saveData(data, 'train_data.txt', path)
Hope this is a relevant answer for your question.

Folium library error in choropleth

Am using folium library with an open data set from kaggle,
map.choropleth(geo_path=country_geo, data=plot_data,
columns=['CountryCode', 'Value'],
key_on='feature.id',
fill_color='YlGnBu', fill_opacity=0.7, line_opacity=0.2,
legend_name=hist_indicator
)
The above part of the code is giving me the following error:
TypeError: choropleth() got an unexpected keyword argument 'geo_path'
When I replace geo_path with geo_data I get this error:
JSONDecodeError: Expecting value: line 7 column 1 (char 6)
Is the issue related to "UCSanDiegoX: DSE200x Python for Data Science"? I took the advice of Cody and rename geo_path to geo_data at the specifications of map.choropleth.
At the git hub repository, take care of using the RAW data, which is in fact a file structured with the format GeoJSON. The first two lines should start like the code provided below
{"type":"FeatureCollection","features":[
{"type":"Feature","properties":{"name":"Afghanistan"},"geometry":
{"type":"Polygon","coordinates":[[[61.210817,35.650072],.....
geo_path doesn't work because it is not a parameter for choropleth. You are correct in replacing it with geo_data.
Your second error is likely due to non-existent or incorrectly formatted geojson file.
From http://python-visualization.github.io/folium/docs-master/modules.html?highlight=chor# your argument for geo_data needs to be a "URL, file path, or data (json, dict, geopandas, etc) to your GeoJSON geometries".
GeoJSON formatted files follow this structure from geojson.org:
{
"type": "Feature",
"geometry": {
"type": "Point",
"coordinates": [125.6, 10.1]
},
"properties": {
"name": "Dinagat Islands"
}
}

Resources