Tensorboard + Keras + ML Engine - keras

I currently have Google Cloud ML Engine setup to train models created in Keras. When using Keras, it seems ML Engine does not automatically save the logs to a storage bucket. I see the logs in the ML Engine Jobs page but they do not show in my storage bucket and therefore I am unable to run tensorboard while training.
You can see the job completed successfully and produced logs:
But then there are no logs saved in my storage bucket:
I followed this tutorial when setting up my environment: (http://liufuyang.github.io/2017/04/02/just-another-tensorflow-beginner-guide-4.html)
So, how do I get the logs and run tensorboard when training a Keras model on ML Engine? Has anyone else had success with this?

You will need to create a callback keras.callbacks.TensorBoard(..) in order to write out the logs. See Tensorboad callback. You can supply GCS path as well (gs://path/to/my/logs) to the log_dir argument of the callback and then point Tensorboard to that location. You will add the callback as a list when calling model.fit_generator(...) or model.fit(...).
tb_logs = callbacks.TensorBoard(
log_dir='gs://path/to/logs',
histogram_freq=0,
write_graph=True,
embeddings_freq=0)
model.fit_generator(..., callbacks=[tb_logs])

Related

I can not register a model in my Azure ml experiment using run context

I am trying to register a model inside one of my azure ml experiments. I am able to register it via Model.register but not via run_context.register_model
This are the two code sentences I use. The commented one is the one that fails
learn.path = Path('./outputs').absolute()
Model.register(run_context.experiment.workspace, "outputs/login_classification.pkl","login_classification", tags=metrics)
run_context.register_model("login_classification", "outputs/login_classification.pkl", tags=metrics)
I received the next error:
Message: Could not locate the provided model_path outputs/login_classification.pkl
But model is stored in this path:
Before implementing run_context.register_model() implement run_context = Run.get_context()
I was able to fix the problem by explicitly uploading the model into the run history record before trying for registering the model.
run.upload_file("output/model.pickle", "output/model.pickle")
Check the documentation for Message: Could not locate the provided model_path outputs/login_classification.pkl
To check about Run Class

Testing the sagemaker endpoint deployment locally

How to test the endpoint deployment in sagemaker locally using the sagemaker notebook instance?
The issue is that if we want to test the endpoint using the sagemaker studio notebook then it will take some time before it spins up the docker inference container depending on the instance. This can certainly hamper the development and process cycle!
Create a LocalSession and configure it directly:
from sagemaker.local import LocalSession
sagemaker_session = LocalSession()
sagemaker_session.config = {'local': {'local_code': True}}
Now pass this sagemaker_session to your estimator or model

How to Deploy trained TensorFlow 2.0 models using Amazon SageMaker?

i am trying to deploy a custom trained tensorflow model using Amazon SageMaker. i have trained xlm roberta using tf 2.2.0 for multilingual sentiment analysis task.(please refer to this notebook : https://www.kaggle.com/mobassir/understanding-cross-lingual-models)
now, using trained weight file of my model i am trying to deploy that in sagemaker, i was following this tutorial : https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/
converted some keras code from there to tensorflow.keras for 2.2.0
but when i do : !ls export/Servo/1/variables i can see that export as Savedmodel generating empty variables directory like this : https://github.com/tensorflow/models/issues/1988
i can't find any documentation help for tf 2.2.0 trained model deployment
need example like this : https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ for tf 2.x models and not keras
even though !ls export/Servo/1/variables shows empty directory but An endpoint was created successfully and now i am not sure if my model was deployed successfully or not because when i try to test the model deployment inside aws notebook by using predictor = sagemaker.tensorflow.model.TensorFlowPredictor(endpoint_name, sagemaker_session)
i.e. predictor.predict(data) i get the following error message:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from model with message "{
"error": "Session was not created with a graph before Run()!"
}"
related problem : Inference error with TensorFlow C++ on iOS: "Invalid argument: Session was not created with a graph before Run()!"
the code i tried can be found here : https://pastebin.com/sGuTtnSD

Unable to register an ONNX model in azure machine learning service workspace

I was trying to register an ONNX model to Azure Machine Learning service workspace in two different ways, but I am getting errors I couldn't solve.
First method: Via Jupyter Notebook and python Script
model = Model.register(model_path = MODEL_FILENAME,
model_name = "MyONNXmodel",
tags = {"onnx":"V0"},
description = "test",
workspace = ws)
The error is : HttpOperationError: Operation returned an invalid status code 'Service invocation failed!Request: GET https://cert-westeurope.experiments.azureml.net/rp/workspaces'
Second method: Via Azure Portal
Anyone can help please?
error 413 means the payload is too large. Using Azure portal, you can only upload a model upto 25MB in size. Please use python SDK to upload models larger than 25MB.

In Google Cloud Dataproc, where is all the log stored?

I have a PySpark job that I am distributing across a 1-master, 3-worker cluster.
I have some python print commands which help me debug my code.
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')
print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
Now, when I run the code on Google Dataproc with the master set as local, the print outputs correctly. However, when I try to run it on yarn, the print with YARN-based Spark, the print outputs do not appear in the Google Cloud Console under the jobs section of the Dataproc UI.
Where can I access these python print outputs from each of the workers and master which do not appear in the Google Dataproc Console
If you're using Dataproc, why to access the logs via Spark UI? The better way would be to:
Submit a job using gcloud dataproc jobs submit example
Once the job is submitted, you can access Cloud Dataproc job driver output using the Cloud Platform Console, the gcloud command, or Cloud Storage, as explained below.
The Cloud Platform Console allows you to view a job's realtime driver output. To view job output, go to your project's Cloud Dataproc Jobs section, then click on the Job ID to view job output.
Reference Documentation
If you really want to access to the YARN interface (with the detailed list of all the jobs and their logs), you can do the following :
Get the external ip address of your master. You can find it in the Cluster Details/VM instances in the UI.
http://img15.hostingpics.net/pics/386611Capturedecran20170403a162303.png
Just click on your master.
Connect to the URL : http://yourMastersExternalIpAddress:8088/cluster

Resources