Is there a way to re-train an Azure Custom Speech model with additional text data? At the moment the only way is create a new text data source with the old data + new data and train a new model every time.
My question is regarding using a text file (related text) and not audio + text. The text file improves the accuracy of recognition greatly but has a 200 MB limit. While this limit might do to start with, it will quickly seem too less especially for an enterprise application.
Related
I just finished training a Custom Azure Translate Model with a set of 10.000 sentences. I now have the options to review the result and test the data. While I already get a good result score I would like to continue training the same model with additional data sets before publishing. I cant find any information regarding this in the documentation.
The only remotely close option I can see is to duplicate the first model and add the new data sets but this would create a new model and not advance the original one.
Once the project is created, we can train with different models on different datasets. Once the dataset is uploaded and the model was trained, we cannot modify the content of the dataset or upgrade it.
https://learn.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model
The above document can help you.
I am following a tutorial of Vertex AI on google cloud, based on colab (text classification):
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/automl/automl-text-classification.ipynb
and in particular I would like to understand how to use on GCP (Google Cloud Platform) a model trained in vertex-ai for thousands of predictions.
I think the interesting part of the tutorial about this topic is section "Get batch predictions from your model". However there they show a method that involves producing a bunch of files, one for each single text, and save all of them in a bucket on google cloud storage. These are the lines on the notebook where this is done:
instances = [
"We hiked through the woods and up the hill to the ice caves",
"My kitten is so cute",
]
input_file_name = "batch-prediction-input.jsonl"
...
# Instantiate the Storage client and create the new bucket
storage = storage.Client()
bucket = storage.bucket(BUCKET_NAME)
# Iterate over the prediction instances, creating a new TXT file
# for each.
input_file_data = []
for count, instance in enumerate(instances):
instance_name = f"input_{count}.txt"
instance_file_uri = f"{BUCKET_URI}/{instance_name}"
# Add the data to store in the JSONL input file.
tmp_data = {"content": instance_file_uri, "mimeType": "text/plain"}
input_file_data.append(tmp_data)
# Create the new instance file
blob = bucket.blob(instance_name)
blob.upload_from_string(instance)
input_str = "\n".join([str(d) for d in input_file_data])
file_blob = bucket.blob(f"{input_file_name}")
file_blob.upload_from_string(input_str)
...
and after this they load the model and create a job:
job_display_name = "e2e-text-classification-batch-prediction-job"
model = aiplatform.Model(model_name=model_name)
batch_prediction_job = model.batch_predict(
job_display_name=job_display_name,
gcs_source=f"{BUCKET_URI}/{input_file_name}",
gcs_destination_prefix=f"{BUCKET_URI}/output",
sync=True,
)
My question is this. Is it really necessary to produce one single file for each sentence? If we have thousand of texts, this involves producing and saving on GCP thousand of small files. Doesn't this harm performance? Moreover, is the model still able to process "in batches" the input, like in a usual Tensorflow Model, taking advantage of vectorization?
I checked the tutorial provided and the data used to train the model is per sentence. With regards to your questions:
Is it really necessary to produce one single file for each sentence?
It really depends on what are you predicting if you are predicting sentences, paragraphs, etc. You can try passing a file with multiple sentences or paragraphs to test if the trained model can handle it.
If you are not satisfied with the results, you can add more training data in multiple sentences or paragraphs (if that is your requirement) and retrain the model then test again until you are satisfied with the results.
If we have thousand of texts, this involves producing and saving on GCP thousand of small files. Doesn't this harm performance?
Based from the batch prediction documentation, it is possible to enable scaling when generating batch prediction jobs. So this should handle the concern about the performance.
If you use an autoscaling configuration, Vertex AI automatically scales your DeployedModel or BatchPredictionJob to use more prediction
nodes when the CPU usage of your existing nodes gets high. Vertex AI
scales your nodes based on CPU usage even if you have configured your
prediction nodes to use GPUs; therefore if your prediction throughput
is causing high GPU usage, but not high CPU usage, your nodes might
not scale as you expect.
Here is a sample code on how to define enable scaling on batch prediction jobs using the code from the tutorial:
job_display_name = "e2e-text-classification-batch-prediction-job-scale"
model = aiplatform.Model(model_name=model_name)
batch_prediction_job = model.batch_predict(
job_display_name=job_display_name,
gcs_source=f"{BUCKET_URI}/{input_file_name}",
gcs_destination_prefix=f"{BUCKET_URI}/output",
sync=True,
machine_type='n1-highcpu-16', #define machine type
starting_replica_count=1,
max_replica_count=10, # set max_replica_count > starting_replica_count to enable scaling
)
batch_prediction_job_name = batch_prediction_job.resource_name
I checked the generated logs and it took effect:
Moreover, is the model still able to process "in batches" the input, like in a usual Tensorflow Model, taking advantage of vectorization?
I'm not familiar with Tensorflow model vectorization. It might be better to create a separate question for this so the community can contribute.
I am doing a crop recommender system using the Matchbox recommender system in Azure ml studio.
Dataset
when I split data, it did not make the split. one dataset becomes full and another becomes null.
how to overcome this?
this is the development
I have taken same dataset provided by you.
Performed Split data
This is my split data configuration
Total records = 9
After visualizing Dataset1, Number of records = 6
After visualizing Dataset2, Number of records = 3
Successfully split data.
Maybe you can check split data configuration with my split data configuration.
Here is the link to the document to Split Data using Recommender Split. We recommend that you review the walkthrough provided with this sample experiment in the Azure AI Gallery: Movie Recommendation.
I use Microsoft Cognitive Services Face API for a face recognition project, where users keep adding faces over a period of time. Previously, the faces were stored in a "Face List". I am shifting the faces to a "Large Face List" now. However, it requires a training call, which "Face Lists" did not require.
I am unable to find any documentation that mentions if
we have to train it once? or
train it every time a face is added?
It is not stated in the REST documentation for Face API but it is stated in the actual documentation of the Face API at the very beginning.
To enable Face search performance for Identification and FindSimilar
in large scale, introduce a Train operation to preprocess the
LargeFaceList and LargePersonGroup. The training time varies from
seconds to about half an hour based on the actual capacity. During the
training period, it's possible to perform Identification and
FindSimilar if a successful training operating was done before. The
drawback is that the new added persons and faces don't appear in the
result until a new post migration to large-scale training is
completed.
Which means you need to train it every time there is an addition to the faces, as LargeFaceList is meant for large-scale use (with up to 1,000,000 faces), thus, if you don't require that capacity, then you might want to go with FaceList (with up to 1,000 faces) since it doesn't require training every time.
I created a custom classifier by using this demo. Although, I trained my two class dataset, while testing (trying the classifier) for some images (test images, not presented in training images) I get the error "The score for this image is not above the threshold of 0.5 based on the training data provided". How can I change this threshold in the scripts (javascripts)?
For example, I'm ok to have classification data for images with ranks more than 0.2.
Trying to help you, first, I recommend to you read and to know the Best practices from one IBM Professional for getting one better result or accuracy using Visual Recognition.
But, talking about your question, this error is one condition inside the Project by IBM Developers, you can simple change the value in the line #L270:
//change this value
params.threshold = 0.5; //So the classifers only show images with a confindence level of 0.5 or higher
Guidelines for training your Visual Recognition Classifiers.
API Reference for Visual Recognition using Node.js