I am creating a Question Answering model using simpletransformers. I would also like to use wandb to track model artifacts. As I understand from wandb docs, there is an integration touchpoint for simpletransformers but there is no mention of logging artifacts.
I would like to log artifacts generated at the train, validation, and test phase such as train.json, eval.json, test.json, output/nbest_predictions_test.json and best performing model.
Currently simpleTransformers doesn't support logging artifacts within the training/testing scripts. But you can do it manually:
import os
with wandb.init(id=model.wandb_run_id, resume="allow", project=wandb_project) as training_run:
for dir in sorted(os.listdir("outputs")):
if "checkpoint" in dir:
artifact = wandb.Artifact("model-checkpoints", type="checkpoints")
artifact.add_dir("outputs" + "/" + dir)
training_run.log_artifact(artifact)
For more info, you can follow along with the W&B notebook in the SimpleTransofrmer's README.md
Related
I am trying to create a pipeline with Python SDK v2 in Azure Machine Learning Studio. Been stuck on this error for many.. MANY.. hours now, so now I am reaching out.
I have been following this guide: https://learn.microsoft.com/en-us/azure/machine-learning/tutorial-pipeline-python-sdk
My setup is very similar, but I split "data_prep" into two separate steps, and I am using a custom ml model.
How the pipeline is defined:
`
# the dsl decorator tells the sdk that we are defining an Azure ML pipeline
from azure.ai.ml import dsl, Input, Output
import pathlib
import os
#dsl.pipeline(
compute=cpu_compute_target,
description="Car predict pipeline",
)
def car_predict_pipeline(
pipeline_job_data_input,
pipeline_job_registered_model_name,
):
# using data_prep_function like a python call with its own inputs
data_prep_job = data_prep_component(
data=pipeline_job_data_input,
)
print('-----------------------------------------------')
print(os.path.realpath(str(pipeline_job_data_input)))
print(os.path.realpath(str(data_prep_job.outputs.prepared_data)))
print('-----------------------------------------------')
train_test_split_job = traintestsplit_component(
prepared_data=data_prep_job.outputs.prepared_data
)
# using train_func like a python call with its own inputs
train_job = train_component(
train_data=train_test_split_job.outputs.train_data, # note: using outputs from previous step
test_data=train_test_split_job.outputs.test_data, # note: using outputs from previous step
registered_model_name=pipeline_job_registered_model_name,
)
# a pipeline returns a dictionary of outputs
# keys will code for the pipeline output identifier
return {
# "pipeline_job_train_data": train_job.outputs.train_data,
# "pipeline_job_test_data": train_job.outputs.test_data,
"pipeline_job_model": train_job.outputs.model
}
`
I managed to run every single component successfully, in order, via the command line and produced a trained model. Ergo the components and data works fine, but the pipeline won't run.
I can provide additional info, but I am not sure what is needed and I do not want to clutter the post.
I have tried googling. I have tried comparing the tutorial pipeline with my own. I have tried using print statements to isolate the issue. Nothing has worked so far. Nothing that I have done has changed the error either, it's the same error no matter what.
Edit:
Some additional info about my environment:
from azure.ai.ml.entities import Environment
custom_env_name = "pipeline_test_environment_pricepredict_model"
pipeline_job_env = Environment(
name=custom_env_name,
description="Environment for testing out Jeppes model in pipeline building",
conda_file=os.path.join(dependencies_dir, "conda.yml"),
image="mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest",
version="1.0",
)
pipeline_job_env = ml_client.environments.create_or_update(pipeline_job_env)
print(
f"Environment with name {pipeline_job_env.name} is registered to workspace, the environment version is {pipeline_job_env.version}"
)
Build status of environment. It had already run successfully.
In azure machine learning studio, when the application was running and model was deployed we have the default options to get the curated environments or custom environments. If the environment was created based on the existing deployment, we need to check with the build was successful or not.
Until we get the success in the deployment, we cannot get the environment variables noted into the program and we cannot retrieve the variables through the code block.
Select the environment need to be used.
Choose the existing version created.
We will get the mount location details and the docker file if creating using the docker and conda environment.
The environment and up and running successfully. If the case is running, then using the asset ID or the mount details we can retrieve the environment variables information.
/mnt/batch/tasks/shared/LS_root/mounts/clusters/workspace-name/code/files/docker/Dockerfile
I'm running a sklearn pipeline with hyperparameter search (let's say GridSearch). Now, I am logging artifacts such as test results and whole-dataset predictions. I'd like to retrieve these artifacts but the mlflow API is getting in the way...
import mlflow
mlflow.set_tracking_uri("sqlite:///mlruns/mlruns.db")
mlflow.set_registry_uri("./mlruns/")
run_ids = [r.run_id for r in mlflow.list_run_infos(mlflow.get_experiment_by_name("My Experiment").experiment_id)]
With the above code, I can retrieve all runs but I have no way of telling which one is a toplevel run with artifacts logged or a sub-run spawned by the GridSearch procedure.
Is there some way of querying only for parent runs, so I can retrieve these csv files in order to plot the results? I can of course go to the web api and manually select the run then copy the URI for the file, but I'd like to do it programmatically instead of opening a tab and clicking things.
I am using fastai with pytorch to fine tune XLMRoberta from huggingface.
I've trained the model and everything is fine on the machine where I trained it.
But when I try to load the model on another machine I get OSError - Not Found - No such file or directory pointing to .cache/torch/transformers/. The issue is the path of a vocab_file.
I've used fastai's Learner.export to export the model in .pkl file, but I don't believe that issue is related to fastai since I found the same issue appearing in flairNLP.
It appears that the path to the cache folder, where the vocab_file is stored during the training, is embedded in the .pkl file:
The error comes from transformer's XLMRobertaTokenizer __setstate__:
def __setstate__(self, d):
self.__dict__ = d
self.sp_model = spm.SentencePieceProcessor()
self.sp_model.Load(self.vocab_file)
which tries to load the vocab_file using the path from the file.
I've tried patching this method using:
pretrained_model_name = "xlm-roberta-base"
vocab_file = XLMRobertaTokenizer.from_pretrained(pretrained_model_name).vocab_file
def _setstate(self, d):
self.__dict__ = d
self.sp_model = spm.SentencePieceProcessor()
self.sp_model.Load(vocab_file)
XLMRobertaTokenizer.__setstate__ = MethodType(_setstate, XLMRobertaTokenizer(vocab_file))
And that successfully loaded the model but caused other problems like missing model attributes and other unwanted issues.
Can someone please explain why is the path embedded inside the file, is there a way to configure it without reexporting the model or if it has to be reexported how to configure it dynamically using fastai, torch and huggingface.
I faced the same error. I had fine tuned XLMRoberta on downstream classification task with fastai version = 1.0.61. I'm loading the model inside docker.
I'm not sure about why the path is embedded, but I found a workaround. Posting for future readers who might be looking for workaround as retraining is usually not possible.
I created /home/.cache/torch/transformer/ inside the docker image.
RUN mkdir -p /home/<username>/.cache/torch/transformers
Copied the files (which were not found in docker) from my local /home/.cache/torch/transformer/ to docker image /home/.cache/torch/transformer/
COPY filename:/home/<username>/.cache/torch/transformers/filename
I can successfully submit an experiment to processing on a remote compute target on Azure ML.
In my notebook, for submitting the experiment, I have:
# estimator
estimator = Estimator(
source_directory='scripts',
entry_script='exp01.py',
compute_target='pc2',
conda_packages=['scikit-learn'],
inputs=[data.as_named_input('my_dataset')],
)
# Submit
exp = Experiment(workspace=ws, name='my_exp')
# Run the experiment based on the estimator
run = exp.submit(config=estimator)
RunDetails(run).show()
run.wait_for_completion(show_output=True)
However, in order to keep things clean, I want to define my general use functions on an auxiliary script, so the first will import it.
On my script experiment file exp01.py, I wanted:
import custom_functions as custom
# azure experiment start
run = Run.get_context()
# the data from azure datasets/datastorage
df = run.input_datasets['my_dataset'].to_pandas_dataframe()
# prepare data
df_transformed = custom.prepare_data(df)
# split data
X_train, X_test, y_train, y_test = custom.split_data(df_transformed)
# run my models.....
model_name = 'RF'
model = custom.model_x(model_name, a_lot_of_args)
# log the results
run.log(model_name, results)
# azure finish
run.complete()
The thing is: Azure wont let me import the custom_functions.py.
How are you doing it?
TL;DR any files you put inside the source_directory in your case, scripts will be available to the Estimator.
To make this happen, simply create a file called custom_functions.py in the scripts folder that contains your prepare_data(), split_data(), model_x() functions.
I also recommend that you include only exactly what you need in the source_directory folder and make distinct folders for each Estimator because:
the entire folder's contents will be uploaded when you use a remote compute_target, and
when you started using ML Pipeilnes (which are awesome), PythonScriptSteps allow_reuse parameter will look to see if any files in the source_directory have changed when determining if the step needs to run again or not.
Lastly, when you want to share general utility functions across PythonScriptSteps or Estimators without having to copy and paste code, that's when you might want to consider creating a custom python package.
I have been following github repository for "Tensorflow on Android".
I was able to build the code using bazel and then import the Android project to Android Studio, as mentioned here.
As you can see here, after building the APK, using Android Studio, the Model files/Graphs are included in the tensorflow/examples/android/assets
By default, tensorflow_inception_graph.pb and imagenet_comp_graph_label_strings.txt are included, from inception5 file which is downloaded while the APK is built.
What's the issue?
I have a retrained graph (InceptionV3 model, mentioned in tensorflow/examples/image_retraining/retrain.py), which I was able to place in the assets folder in android directory and generate a working APK.
Inference time while I was using the default graph or .pb file was ~500ms and with my retrained.pb or graph it is ~1400ms.(tested on OnePlus3T device)
Please help me understand
How to analyze the default tensorflow_inception_graph.pb on Tensorboard
Last May they have introduced a helper script called import_pb_to_tensorboard to do just that.
usage: import_pb_to_tensorboard.py [-h] [--model_dir MODEL_DIR]
[--log_dir LOG_DIR]
optional arguments:
-h, --help show this help message and exit
--model_dir MODEL_DIR
The location of the protobuf ('pb') model to
visualize.
--log_dir LOG_DIR The location for the Tensorboard log to begin
visualization from.
Note that currently, the version in master seems to have received more love than the one present in the latest 1.2.1 distribution of tensorflow, so I would suggest to use this one.