not be able to split dataset using recommender split in azure Ml studio - azure

I am doing a crop recommender system using the Matchbox recommender system in Azure ml studio.
Dataset
when I split data, it did not make the split. one dataset becomes full and another becomes null.
how to overcome this?
this is the development

I have taken same dataset provided by you.
Performed Split data
This is my split data configuration
Total records = 9
After visualizing Dataset1, Number of records = 6
After visualizing Dataset2, Number of records = 3
Successfully split data.
Maybe you can check split data configuration with my split data configuration.

Here is the link to the document to Split Data using Recommender Split. We recommend that you review the walkthrough provided with this sample experiment in the Azure AI Gallery: Movie Recommendation.

Related

What is the importance of Azure ML dataset versioning?

I created an Azure ML dataset with a single file inside a storage blob container. Azure ML studio portal then showed 1 file in the dataset version 1.
I wanted to add 2 more files and create a new dataset version. So I copied 2 more files to the same blob container folder. Surprisingly even before I created a new dataset version, the ML studio portal UI shows the number of files in the same dataset as 3. (image below).
I then went through Azure ML versioning docs which tell datasets are just references to original data. I also see a suggestion to create new folders for new data and I agree that the new files were not copied to a new folder here as recommended.
But still, the metadata (e.g. files in dataset, total size of dataset etc) of a previously created dataset version is getting updated. What is the importance of Azure ML dataset versioning if metadata of dataset version itself is being updated?
A related question was in SO, but closed as a bug.
Versioning will improve the accuracy of the model. Based on the data extracted we can perform the prediction model running on different versions of the dataset. The dataset may consist of the same name, but the version will contain different values. This supports the parallel execution of the models on the same storage account support.
We can create different Auto ML prediction models with different versions of the dataset.
The two versions are uploaded to the same blobstorage and now using any one of the version, I will run the prediction model (Classification).
The above screen is of churn_analysis running as the Auto ML prediction model, running with 25% of testing and 75% of training the dataset. The Version of the dataset used in this prediction model is mentioned in the below image.
In the same manner we can do the prediction model with different versions of the training and test set splits and also the type of model for each version can be chosen. We will get different models result on the single dataset for better understanding of the data.

Is it possible to keep training the same Azure Translate Custom Model with additional data sets?

I just finished training a Custom Azure Translate Model with a set of 10.000 sentences. I now have the options to review the result and test the data. While I already get a good result score I would like to continue training the same model with additional data sets before publishing. I cant find any information regarding this in the documentation.
The only remotely close option I can see is to duplicate the first model and add the new data sets but this would create a new model and not advance the original one.
Once the project is created, we can train with different models on different datasets. Once the dataset is uploaded and the model was trained, we cannot modify the content of the dataset or upgrade it.
https://learn.microsoft.com/en-us/azure/cognitive-services/translator/custom-translator/quickstart-build-deploy-custom-model
The above document can help you.

Using model for prediction in Vertex AI (Google Cloud Platform)

I am following a tutorial of Vertex AI on google cloud, based on colab (text classification):
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/automl/automl-text-classification.ipynb
and in particular I would like to understand how to use on GCP (Google Cloud Platform) a model trained in vertex-ai for thousands of predictions.
I think the interesting part of the tutorial about this topic is section "Get batch predictions from your model". However there they show a method that involves producing a bunch of files, one for each single text, and save all of them in a bucket on google cloud storage. These are the lines on the notebook where this is done:
instances = [
"We hiked through the woods and up the hill to the ice caves",
"My kitten is so cute",
]
input_file_name = "batch-prediction-input.jsonl"
...
# Instantiate the Storage client and create the new bucket
storage = storage.Client()
bucket = storage.bucket(BUCKET_NAME)
# Iterate over the prediction instances, creating a new TXT file
# for each.
input_file_data = []
for count, instance in enumerate(instances):
instance_name = f"input_{count}.txt"
instance_file_uri = f"{BUCKET_URI}/{instance_name}"
# Add the data to store in the JSONL input file.
tmp_data = {"content": instance_file_uri, "mimeType": "text/plain"}
input_file_data.append(tmp_data)
# Create the new instance file
blob = bucket.blob(instance_name)
blob.upload_from_string(instance)
input_str = "\n".join([str(d) for d in input_file_data])
file_blob = bucket.blob(f"{input_file_name}")
file_blob.upload_from_string(input_str)
...
and after this they load the model and create a job:
job_display_name = "e2e-text-classification-batch-prediction-job"
model = aiplatform.Model(model_name=model_name)
batch_prediction_job = model.batch_predict(
job_display_name=job_display_name,
gcs_source=f"{BUCKET_URI}/{input_file_name}",
gcs_destination_prefix=f"{BUCKET_URI}/output",
sync=True,
)
My question is this. Is it really necessary to produce one single file for each sentence? If we have thousand of texts, this involves producing and saving on GCP thousand of small files. Doesn't this harm performance? Moreover, is the model still able to process "in batches" the input, like in a usual Tensorflow Model, taking advantage of vectorization?
I checked the tutorial provided and the data used to train the model is per sentence. With regards to your questions:
Is it really necessary to produce one single file for each sentence?
It really depends on what are you predicting if you are predicting sentences, paragraphs, etc. You can try passing a file with multiple sentences or paragraphs to test if the trained model can handle it.
If you are not satisfied with the results, you can add more training data in multiple sentences or paragraphs (if that is your requirement) and retrain the model then test again until you are satisfied with the results.
If we have thousand of texts, this involves producing and saving on GCP thousand of small files. Doesn't this harm performance?
Based from the batch prediction documentation, it is possible to enable scaling when generating batch prediction jobs. So this should handle the concern about the performance.
If you use an autoscaling configuration, Vertex AI automatically scales your DeployedModel or BatchPredictionJob to use more prediction
nodes when the CPU usage of your existing nodes gets high. Vertex AI
scales your nodes based on CPU usage even if you have configured your
prediction nodes to use GPUs; therefore if your prediction throughput
is causing high GPU usage, but not high CPU usage, your nodes might
not scale as you expect.
Here is a sample code on how to define enable scaling on batch prediction jobs using the code from the tutorial:
job_display_name = "e2e-text-classification-batch-prediction-job-scale"
model = aiplatform.Model(model_name=model_name)
batch_prediction_job = model.batch_predict(
job_display_name=job_display_name,
gcs_source=f"{BUCKET_URI}/{input_file_name}",
gcs_destination_prefix=f"{BUCKET_URI}/output",
sync=True,
machine_type='n1-highcpu-16', #define machine type
starting_replica_count=1,
max_replica_count=10, # set max_replica_count > starting_replica_count to enable scaling
)
batch_prediction_job_name = batch_prediction_job.resource_name
I checked the generated logs and it took effect:
Moreover, is the model still able to process "in batches" the input, like in a usual Tensorflow Model, taking advantage of vectorization?
I'm not familiar with Tensorflow model vectorization. It might be better to create a separate question for this so the community can contribute.

How to access to the dataset transformed by automatic featurization steps in Azure Automated ML

I’m performing a series of experiments with Azure AutoML and I need to see the featurized data. I mean, not just the new features names retrieved by method get_engineered_feature_names() or the featurization details retrieved by get_featurization_summary(), I refer to the whole transformed dataset, the one obtained after scaling/normalization/featurization that is therefore used to train the models.
Is it possible to access to this dataset or download it as a file?
Thanks.
Microsoft expert confirmed that currently they "don't store the dataset from scaling/normalization/featurization after the run is complete". Answer here.

Example for Azure AutoML Forecasting for time series with multiple covariate features

I would like to use Azure AutoML for forecasting where I have multiple features for one timeseries. Is there any example which I can replicate?
I have been looking into: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-beer-remote/auto-ml-forecasting-beer-remote.ipynb
and
https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb
but no luck using multiple features instead of only one timeseries.
Any help is greatly appreciated
It looks like you are trying to find a notebook that shows how to predict a target variable when exogenous features are provided. The OJ sample notebook you included is actually a good example to reference for this scenario.
On a second glance, you'll find that in the OJ sample, `Quantity' is a function of 'Price' and other variables. We suggest trying to focus on a single time series within the OJ dataset (a single store & brand combo) as the concept could be lost in the focus on multiple series. Also note that in this example, the OJ dataset does have multiple features, we just only specify which features need to be excluded.
OJ Sample Notebook: https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-orange-juice-sales/auto-ml-forecasting-orange-juice-sales.ipynb
-Sabina, Azure Machine Learning PM
Please check here,
Auto-train a time-series forecast model - Azure Machine Learning | Microsoft Docs
Please check the below many models accelerator which models timeseries data (but in a different domain). This can be useful.
buswrecker/energy-many-models: An offshoot of the original AML Many-Models - for the Energy Sector (github.com)
AML AutoML forecasting models address missing data in featurization stage via forward fill if missing value is in target column or median value if in feature column. Also libraries like Prophet which are supported via Auto ML can be robust.

Resources