Using model for prediction in Vertex AI (Google Cloud Platform) - python-3.x

I am following a tutorial of Vertex AI on google cloud, based on colab (text classification):
https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/master/notebooks/official/automl/automl-text-classification.ipynb
and in particular I would like to understand how to use on GCP (Google Cloud Platform) a model trained in vertex-ai for thousands of predictions.
I think the interesting part of the tutorial about this topic is section "Get batch predictions from your model". However there they show a method that involves producing a bunch of files, one for each single text, and save all of them in a bucket on google cloud storage. These are the lines on the notebook where this is done:
instances = [
"We hiked through the woods and up the hill to the ice caves",
"My kitten is so cute",
]
input_file_name = "batch-prediction-input.jsonl"
...
# Instantiate the Storage client and create the new bucket
storage = storage.Client()
bucket = storage.bucket(BUCKET_NAME)
# Iterate over the prediction instances, creating a new TXT file
# for each.
input_file_data = []
for count, instance in enumerate(instances):
instance_name = f"input_{count}.txt"
instance_file_uri = f"{BUCKET_URI}/{instance_name}"
# Add the data to store in the JSONL input file.
tmp_data = {"content": instance_file_uri, "mimeType": "text/plain"}
input_file_data.append(tmp_data)
# Create the new instance file
blob = bucket.blob(instance_name)
blob.upload_from_string(instance)
input_str = "\n".join([str(d) for d in input_file_data])
file_blob = bucket.blob(f"{input_file_name}")
file_blob.upload_from_string(input_str)
...
and after this they load the model and create a job:
job_display_name = "e2e-text-classification-batch-prediction-job"
model = aiplatform.Model(model_name=model_name)
batch_prediction_job = model.batch_predict(
job_display_name=job_display_name,
gcs_source=f"{BUCKET_URI}/{input_file_name}",
gcs_destination_prefix=f"{BUCKET_URI}/output",
sync=True,
)
My question is this. Is it really necessary to produce one single file for each sentence? If we have thousand of texts, this involves producing and saving on GCP thousand of small files. Doesn't this harm performance? Moreover, is the model still able to process "in batches" the input, like in a usual Tensorflow Model, taking advantage of vectorization?

I checked the tutorial provided and the data used to train the model is per sentence. With regards to your questions:
Is it really necessary to produce one single file for each sentence?
It really depends on what are you predicting if you are predicting sentences, paragraphs, etc. You can try passing a file with multiple sentences or paragraphs to test if the trained model can handle it.
If you are not satisfied with the results, you can add more training data in multiple sentences or paragraphs (if that is your requirement) and retrain the model then test again until you are satisfied with the results.
If we have thousand of texts, this involves producing and saving on GCP thousand of small files. Doesn't this harm performance?
Based from the batch prediction documentation, it is possible to enable scaling when generating batch prediction jobs. So this should handle the concern about the performance.
If you use an autoscaling configuration, Vertex AI automatically scales your DeployedModel or BatchPredictionJob to use more prediction
nodes when the CPU usage of your existing nodes gets high. Vertex AI
scales your nodes based on CPU usage even if you have configured your
prediction nodes to use GPUs; therefore if your prediction throughput
is causing high GPU usage, but not high CPU usage, your nodes might
not scale as you expect.
Here is a sample code on how to define enable scaling on batch prediction jobs using the code from the tutorial:
job_display_name = "e2e-text-classification-batch-prediction-job-scale"
model = aiplatform.Model(model_name=model_name)
batch_prediction_job = model.batch_predict(
job_display_name=job_display_name,
gcs_source=f"{BUCKET_URI}/{input_file_name}",
gcs_destination_prefix=f"{BUCKET_URI}/output",
sync=True,
machine_type='n1-highcpu-16', #define machine type
starting_replica_count=1,
max_replica_count=10, # set max_replica_count > starting_replica_count to enable scaling
)
batch_prediction_job_name = batch_prediction_job.resource_name
I checked the generated logs and it took effect:
Moreover, is the model still able to process "in batches" the input, like in a usual Tensorflow Model, taking advantage of vectorization?
I'm not familiar with Tensorflow model vectorization. It might be better to create a separate question for this so the community can contribute.

Related

Using torch.distributed modules on AWS instance to parallelise model training by splitting model

I am wondering how to do model parallelism using pytorch's distributed modules. Basically what I want to do is the following -
class LargeModel(nn.Module):
def __init__(self, in_features, n_hid, out_features) -> None:
super().__init__()
self.to_train_locally = nn.Linear(in_features, n_hid)
self.to_train_on_aws = nn.Linear(n_hid, out_features)
def forward(self, input):
intermediate = self.to_train_locally(input)
res = send_to_aws_for_forward_pass(self.to_train_on_aws, intermediate)
return res
basically, I want to train a "large" model, which I want to split into the "local" component, which is the self.to_train_locally component which can consist of an aribtrary number of layers, which I want to reside in my personal laptop. I then want a second component of the model to train on an AWS EC2 instance that I have, which is the self.to_train_on_aws, which I want to reside on AWS. This component will be much bigger that the self.to_train_locally component. In this toy example, of course the entire model can be stored locally, but that is not the point - I want a framework that will allow me to train a part of the model locally, while do the bulk of the training on AWS.
How would I set up a training routine for this? I have looked at the following tutorials from pytorch's official documentation, but none of them are of any help -
https://pytorch.org/tutorials/intermediate/dist_tuto.html This talks about gloo etc. and co-ordination tools that I assume more sense for larger teams. If I want to do this for a personal project, where some layers exist and train on my local laptop and a personal EC2 instance, how would I do this?
https://pytorch.org/tutorials/intermediate/rpc_tutorial.html this is not useful either, because it talks about model parallelism in a single machine, where multiple processes on a single machine are spawned/forked, but this is not what I want to do. For example, I can't just set
os.environ['MASTER_ADDR'] = 'my:aws:instance:public:ip:addr'
os.environ['MASTER_PORT'] = '29500' # any port
from the RNN example given in this tutorial.
Has anyone tried to do something like this before? How would distributed autograd and a distributed optimiser work in this case in a training loop? Any help will be appreciated, especially if you can point me to code where someone has tried to do something like this before.
I have looked at the following tutorials from pytorch's official documentation, but none of them are of any help -
https://pytorch.org/tutorials/intermediate/dist_tuto.html This talks about gloo etc. and co-ordination tools that I assume more sense for larger teams. If I want to do this for a personal project, where some layers exist and train on my local laptop and a personal EC2 instance, how would I do this?
https://pytorch.org/tutorials/intermediate/rpc_tutorial.html this is not useful either, because it talks about model parallelism in a single machine, where multiple processes on a single machine are spawned/forked, but this is not what I want to do. For example, I can't just set
os.environ['MASTER_ADDR'] = 'my:aws:instance:public:ip:addr'
os.environ['MASTER_PORT'] = '29500' # any port
from the RNN example given in this tutorial.
Has anyone tried to do something like this before? How would distributed autograd and a distributed optimiser work in this case in a training loop? Any help will be appreciated, especially if you can point me to code where someone has tried to do something like this before.
EDIT 1
What I am trying to do is basically explained in this paper https://arxiv.org/pdf/1903.11314.pdf, under the title "Model parallelism" -
"In model parallelism, the DL model is split, and each worker loads a different part of the DL model for training (see Figure 5). The worker(s) that hold the input layer of the DL model are fed with the training data. In the forward pass, they compute their output signal
which is propagated to the workers that hold the next layer of the DL model. In the backpropagation
pass, gradients are computed starting at the workers that hold the output layer of the DL model,
propagating to the workers that hold the input layers of the DL model."
Does anyone have an example of how to do this on AWS and using PyTorch?

Spark- The purpose of saving ALS model

I'm trying to understand what would be a purpose of storing ALS model and what would be a use case for use of stored model.
I have a dataset which has over 300M rows and I'm using Hadoop Cluster and Spark to calculate recommendations based on ALS algorithm.
Whole computation takes around 5h and I'm wondering what would be the case of storing my model and use it- for example- the next day and... I don't see any. So, either I'm doing something wrong (which is possible, taking into account fact that I'm beginner in ML world) or ALS algorithm in Spark and possibility of saving on disk is not very helpful.
Right now, I use it as following:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.fit(df_input)
df_recommendations = model.recommendForAllUsers(10)
And as I mentioned. df_input is a DataFrame which contains over 300M rows. Total calculation time is around 5h and after that I receive 10 recommended items for each user in the dataset.
In many tutorials or books. There is an example of training the model and validate it with test data. Something like:
df_input = spark.read.format("avro").load(PATH, schema=SCHEMA)
(training, test) = df_input.randomSplit(weights = [0.7, 0.3])
als = ALS(maxIter=12, regParam=0.05, rank=15, userCol="user", itemCol="item", ratingCol="rating", coldStartStrategy="drop")
model = als.train(training)
model.write().save("saved_model")
...
model = ALSModel.load('saved_model')
predictions = model.transform(test) // or df_input to get predictions for each user
I don't see any pros of using it in a such way. However I see a one big cons- You don't use 30% of data to train a model
As far as I know there isn't a way to use ALS model online (in real time). At least without using any external package/library.
You can't incrementally update this model.
You can't use it for newly registered users because there they don't exist in stored Matrix Factorization, so there won't be any recommendations for them.
All you can do is to check what would be a prediction for given user-item pair. Which is basically the same thing which would be return in the first example of code (with used fit() method)
What would be a reason to store this model on disk and load it when needed? or when (what conditions should be met) should I consider to store model and reuse it? Could you provide a use case?

combine multiple spacy textcat_multilabel models into a single textcat_multilabel model

Problem: I have millions of records that need to be transformed using a bunch of spacy textcat_multilabel models.
// sudo code
for model in models:
nlp = spacy.load(model)
for groups_of_records in records: // millions of records
new_data = nlp.pipe(groups_of_records) // data is getting processed bulk
// process data
bulk_create_records(new_data)
My current loop is as follows:
load a model
loop through records / transform data using model / save
As you can imagine, the more records i process, and the more models i include, the longer this entire process will take. The idea is to make a single model, and just process my data once, instead of (n * num_of_models)
Question: is there a way to combine multiple textcat_multilabel models created from the same spacy config, into a single textcat_multilabel model?
There is no basic feature to just combine models, but there are a couple of ways you can do this.
One is to source all your components into the same pipeline. This is very easy to do, see the double NER project for an example. The disadvantage is that this might not save you much processing time, since separately trained models will still have their own tok2vec layers.
You could combine your training data and train one big model. But if your models are actually separate that would almost certainly cause a reduction in accuracy.
If speed is the primary concern, you could train each of your textcats separately while freezing your tok2vec. That would result in decreased accuracy, though maybe not too bad, and it would allow you to then combine the textcat models in the same pipeline while removing a bunch of tok2vec processing. (This is probably the method I've listed with the best balance of implementation complexity, speed advantage, and accuracy sacrificed.)
One thing that I don't think has been tested is that you could try training separate textcat models at the same time with separate sets of labels by manually specifying the labels to each component in their configs. I am not completely sure that would work but you could try it.

Train multiple models with various measures and accumulate predictions

So I have been playing around with Azure ML lately, and I got one dataset where I have multiple values I want to predict. All of them uses different algorithms and when I try to train multiple models within one experiment; it says the “train model can only predict one value”, and there are not enough input ports on the train-model to take in multiple values even if I was to use the same algorithm for each measure. I tried launching the column selector and making rules, but I get the same error as mentioned. How do I predict multiple values and later put the predicted columns together for the web service output so I don’t have to have multiple API’s?
What you would want to do is to train each model and save them as already trained models.
So create a new experiment, train your models and save them by right clicking on each model and they will show up in the left nav bar in the Studio. Now you are able to drag your models into the canvas and have them score predictions where you eventually make them end up in the same output as I have done in my example through the “Add columns” module. I made this example for Ronaldo (Real Madrid CF player) on how he will perform in match after training day. You can see my demo on http://ronaldoinform.azurewebsites.net
For more detailed explanation on how to save the models and train multiple values; you can check out Raymond Langaeian (MSFT) answer in the comment section on this link:
https://azure.microsoft.com/en-us/documentation/articles/machine-learning-convert-training-experiment-to-scoring-experiment/
You have to train models for each variable that you going to predict. Then add all those predicted columns together and get as a single output for the web service.
The algorithms available in ML are only capable of predicting a single variable at a time based on the inputs it's getting.

Incremental training of ALS model

I'm trying to find out if it is possible to have "incremental training" on data using MLlib in Apache Spark.
My platform is Prediction IO, and it's basically a wrapper for Spark (MLlib), HBase, ElasticSearch and some other Restful parts.
In my app data "events" are inserted in real-time, but to get updated prediction results I need to "pio train" and "pio deploy". This takes some time and the server goes offline during the redeploy.
I'm trying to figure out if I can do incremental training during the "predict" phase, but cannot find an answer.
I imagine you are using spark MLlib's ALS model which is performing matrix factorization. The result of the model are two matrices a user-features matrix and an item-features matrix.
Assuming we are going to receive a stream of data with ratings or transactions for the case of implicit, a real (100%) online update of this model will be to update both matrices for each new rating information coming by triggering a full retrain of the ALS model on the entire data again + the new rating. In this scenario one is limited by the fact that running the entire ALS model is computationally expensive and the incoming stream of data could be frequent, so it would trigger a full retrain too often.
So, knowing this we can look for alternatives, a single rating should not change the matrices much plus we have optimization approaches which are incremental, for example SGD. There is an interesting (still experimental) library written for the case of Explicit Ratings which does incremental updates for each batch of a DStream:
https://github.com/brkyvz/streaming-matrix-factorization
The idea of using an incremental approach such as SGD follows the idea of as far as one moves towards the gradient (minimization problem) one guarantees that is moving towards a minimum of the error function. So even if we do an update to the single new rating, only to the user feature matrix for this specific user, and only the item-feature matrix for this specific item rated, and the update is towards the gradient, we guarantee that we move towards the minimum, of course as an approximation, but still towards the minimum.
The other problem comes from spark itself, and the distributed system, ideally the updates should be done sequentially, for each new incoming rating, but spark treats the incoming stream as a batch, which is distributed as an RDD, so the operations done for updating would be done for the entire batch with no guarantee of sequentiality.
In more details if you are using Prediction.IO for example, you could do an off line training which uses the regular train and deploy functions built in, but if you want to have the online updates you will have to access both matrices for each batch of the stream, and run updates using SGD, then ask for the new model to be deployed, this functionality of course is not in Prediction.IO you would have to build it on your own.
Interesting notes for SGD updates:
http://stanford.edu/~rezab/classes/cme323/S15/notes/lec14.pdf
For updating Your model near-online (I write near, because face it, the true online update is impossible) by using fold-in technique, e.g.:
Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender Systems.
Ou You can look at code of:
MyMediaLite
Oryx - framework build with Lambda Architecture paradigm. And it should have updates with fold-in of new users/items.
It's the part of my answer for similar question where both problems: near-online training and handling new users/items were mixed.

Resources