Metric Document is too large in Azure ML Service

Metric Document is too large in Azure ML Service - azure

I am trying to save metrics : loss, validation loss and mAP at every epoch during 100 and 50 epochs but at the end of the experiment I have this error:
Run failed: RunHistory finalization failed: ServiceException: Code: 400 Message: (ValidationError) Metric Document is too large
I am using this code to save the metrics
run.log_list("loss", history.history["loss"])
run.log_list("val_loss", history.history["val_loss"])
run.log_list("val_mean_average_precision", history.history["val_mean_average_precision"])
I don't understand why trying to save only 3 metrics exceeds the limits of Azure ML Service.

You could break the run history list writes into smaller blocks like this:
run.log_list("loss", history.history["loss"][:N])
run.log_list("loss", history.history["loss"][N:])
Internally, the run history service concatenates the blocks with same metric name into a contiguous list.

Related

Submitting multiple runs to the same node on AzureML

I want to perform hyperparameter search using AzureML. My models are small (around 1GB) thus I would like to run multiple models on the same GPU/node to save costs but I do not know how to achieve this.
The way I currently submit jobs is the following (resulting in one training run per GPU/node):
experiment = Experiment(workspace, experiment_name)
config = ScriptRunConfig(source_directory="./src",
script="train.py",
compute_target="gpu_cluster",
environment="env_name",
arguments=["--args args"])
run = experiment.submit(config)
ScriptRunConfig can be provided with a distributed_job_config. I tried to use MpiConfiguration there but if this is done the run fails due to an MPI error that reads as if the cluster is configured to only allow one run per node:
Open RTE detected a bad parameter in hostfile: [...]
The max_slots parameter is less than the slots parameter:
slots = 3
max_slots = 1
[...] ORTE_ERROR_LOG: Bad Parameter in file util/hostfile/hostfile.c at line 407
Using HyperDriveConfig also defaults to submitting one run to one GPU and additionally providing a MpiConfiguration leads to the same error as shown above.
I guess I could always rewrite my train script to train multiple models in parallel, s.t. each run wraps multiple trainings. I would like to avoid this option though, because then logging and checkpoint writes become increasingly messy and it would require a large refactor of the train pipeline. Also this functionality seems so basic that I hope there is a way to do this gracefully. Any ideas?

Use Run.create_children method which will start child runs that are “local” to the parent run, and don’t need authentication.
For AMLcompute max_concurrent_runs map to maximum number of nodes that will be used to run a hyperparameter tuning run.
So there would be 1 execution per node.
single service deployed but you can load multiple model versions in the init then the score function, depending on the request’s param, uses particular model version to score.
or with the new ML Endpoints (Preview).
What are endpoints (preview) - Azure Machine Learning | Microsoft Docs

Optimize the use of BigQuery resources to load 2 million JSON files from GCS using Google Dataflow

I have a vast database comprised of ~2.4 million JSON files that by themselves contain several records. I've created a simple apache-beam data pipeline (shown below) that follows these steps:
Read data from a GCS bucket using a glob pattern.
Extract records from JSON data.
Transform data: convert dictionaries to JSON strings, parse timestamps, others.
Write to BigQuery.
# Pipeline
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
p = beam.Pipeline(options=pipeline_options)
# Read
files = p | 'get_data' >> ReadFromText(files_pattern)
# Transform
output = (files
| 'extract_records' >> beam.ParDo(ExtractRecordsFn())
| 'transform_data' >> beam.ParDo(TransformDataFn()))
# Write
output | 'write_data' >> WriteToBigQuery(table=known_args.table,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_EMPTY,
insert_retry_strategy='RETRY_ON_TRANSIENT_ERROR',
temp_file_format='NEWLINE_DELIMITED_JSON')
# Run
result = p.run()
result.wait_until_finish()
I've tested this pipeline with a minimal sample dataset and is working as expected. But I'm pretty doubtful regarding the optimal use of BigQuery resources and quotas. The batch load quotas are very restrictive, and due to the massive amount of files to parse and load, I want to know if I'm missing some settings that could guarantee the pipeline will respect the quotas and run optimally. I don't want to exceed the quotas as I am running other loads to BigQuery in the same project.
I haven't finished understanding some parameters of the WriteToBigQuery() transform, specifically batch_size, max_file_size, and max_files_per_bundle, or if they could help to optimize the load jobs to BigQuery. Could you help me with this?
Update
I'm not only concerned about BigQuery quotas, but GCP quotas of other resources used by this pipeline are also a matter of concern.
I tried to run my simple pipeline over the target data (~2.4 million files), but I'm receiving the following warning message:
Project [my-project] has insufficient quota(s) to execute this workflow with 1 instances in region us-central1. Quota summary (required/available): 1/16 instances, 1/16 CPUs, 250/2096 disk GB, 0/500 SSD disk GB, 1/99 instance groups, 1/49 managed instance groups, 1/99 instance templates, 1/0 in-use IP addresses. Please see https://cloud.google.com/compute/docs/resource-quotas about requesting more quota.
I don't understand that message completely. The process activated 8 workers successfully and is using 8 from the 8 available in-use IP addresses. Is this a problem? How could I fix it?

If you're worried about load job quotas, you can try streaming data into bigquery that comes with a less restrictive quota policy.
To achieve what you want to do, you can try the Google provided templates or just refer to their code.
Cloud Storage Text to BigQuery (Stream) [code]
Cloud Storage Text to BigQuery (Batch)
And last but not the least, more detailed information can be found on the Google BigQuery I/O connector.

What is the most efficient way to load data into Tensorflow for real time inference?

When setting up a data input pipeline to Tensorflow (web cam images), a large amount of time is spent loading the data from the system RAM to the GPU memory.
I am trying to feed a constant stream of images (1024x1024) through my object detection network. I'm currently using a V100 on AWS to perform inference.
The first attempt was with a simple feed dict operation.
# Get layers
img_input_tensor = sess.graph.get_tensor_by_name('import/input_image:0')
img_anchors_input_tensor = sess.graph.get_tensor_by_name('import/input_anchors:0')
img_meta_input_tensor = sess.graph.get_tensor_by_name('import/input_image_meta:0')
detections_input_tensor = sess.graph.get_tensor_by_name('import/output_detections:0')
detections = sess.run(detections_input_tensor,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
This produced inference times around 0.06 ms per image.
However, after reading the Tensorflow manual I noticed that the tf.data API was recommended for loading data for inference.
# setup data input
data = tf.data.Dataset.from_tensors((img_input_tensor, img_meta_input_tensor, img_anchors_input_tensor, detections_input_tensor))
iterator = data.make_initializable_iterator() # create the iterator
next_batch = iterator.get_next()
# load data
sess.run(iterator.initializer,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
# inference
detections = sess.run([next_batch])[0][3]
This sped up inference time to 0.01ms, put the time taken to load the data took 0.1 ms. This Iterator methods is much longer than the 'slower' feed_dict method significantly. Is there something I can do to speed up the loading process?

Here is a great guide on data pipeline optimization. I personally find the .prefetch method the easiest way to boost your input pipeline. However, the article provides much more advanced techniques.
However, if your input data is not in tfrecords, but you feed it by yourself, you have to implement the described techniques (buffering, interleaved operations) somehow by yourself.

Caffe loading LMDB batches very slowly

I generated an LMDB database using the SSD-Caffe fork here. I have successfully generated the VOC LMDB trainval/test LMDB directories and am able to train the model.
However, during training, it takes inordinatly long to load data from the LMDB database. For example when profiling using Caffe's time function using this command:
ssdcaffe time --model "jobs/VGGNet/VOC0712/SSD_300x300/train.prototxt" --gpu 0 --iterations 20
I get that the forward pass takes on average 8.9s, and the backward pass takes on average 0.5s. On a layer-by-layer inspection, the data injestion layer takes the bulk of that time at 8.7s. See below:
I1129 10:14:11.094445 8011 caffe.cpp:404] data forward: 8660.38 ms.
...
I1129 10:14:11.095383 8011 caffe.cpp:412] Average Forward pass: 8933.31 ms.
I1129 10:14:11.095389 8011 caffe.cpp:414] Average Backward pass: 519.549 ms.
If I half the batchsize from 32 to 16, then the data injestion layer time decreases roughly in half:
I1129 10:20:07.975527 8093 caffe.cpp:404] data forward: 3906.53 ms.
This is clearly not the intended speed, and something is wrong. Any help would be greatly appreciated!

Found my issue:
My images were too big. The standard VOC images which the repo used were ~350x500 pixels, whereas my images were 1080x1920. When I downsized my images by 3x (eg 9x less pixels), my data ingestion layer took only 181ms (a 48x speedup over previous time of 8.6s)

How do we know the model converged while training a neural network for object classification/detection?

I am using Tensorflow's Object Detection API to detect cars. It should detect the cars as one class "car".
I followed sentdex's following series:
https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/
System information:
OS - Ubuntu 18.04 LTS
GPU - Nvidia 940M (VRAM : 2GB)
Tensorflow : 1.10
Python - 3.6
CPU - Intel i5
Problem:
The training process runs pretty fine. In order to know when the model converges and when I should stop training, I observe the loss during the training per step in the terminal where the training is running and also observe the Total Loss graph in Tensorboard via running the following command in another terminal,
$tensorboard --logdit="training"
But even after training till 60k steps, the loss fluctuates between 2.1 to 1.2. If I stop the training and export the inference graph from the last checkpoint(saved in the training/ folder), it detects cars in some cases and in some it gives false positives.
I also tried running eval.py like below,
python3 eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training/ --eval_dir=eval/
but it gives out an error that indicates that the GPU memory fails to run this script along with train.py.
So, I stop the training to make sure the GPU is free and then run eval.py but it creates only one eval point in eval/ folder. Why?
Also, how do I understand from the Precision graphs in Tensorboard that the training needs to be stopped?
I could also post screenshots if anyone wants.
Should I keep training till the loss stays on an average around 1?
Thanks.
PS: Added Total Loss graph below till 66k steps.
PS2: After 2 days training(and still on) this is the total loss graph below.

Usually, one uses a separate set of data to measure the error and generalisation abilities of the model. So, one would have the following sets of data to train and evaluate a model:
Training set: The data used to train the model.
Validation set: A separate set of data which will be used to measure the error during training. The data of this set is not used to perform any weight updates.
Test set: This set is used to measure the model's performance after the training.
In your case, you would have to define a separate set of data, the validation set and run an evaluation repeadingly after a fixed number of batches/steps and log the error or accuracy. What usually happens is, that the error on that data will decrease in the beginning and increase at a certain point during training. So it's important to keep track of that error and to generate a checkpoint whenever this error is decreases. The checkpoint with the lowest error on your validation data is one that you want to use. This technique is called Early Stopping.
The reason why the error increases after a certain point during training is called Overfitting. It tells you that the model losses it's ability to generalize to unseen data.
Edit:
Here's an example of a training loop with early stopping procedure:
for step in range(1, _MAX_ITER):
if step % _TEST_ITER == 0:
sample_count = 0
while True:
try:
test_data = sess.run(test_batch)
test_loss, summary = self._model.loss(sess, test_data[0], self._assign_target(test_data), self._merged_summary)
sess.run(self._increment_loss_opt, feed_dict={self._current_loss_pl: test_loss})
sample_count += 1
except tf.errors.OutOfRangeError:
score = sess.run(self._avg_batch_loss, feed_dict={self._batch_count_pl: sample_count})
best_score =sess.run(self._best_loss)
if score < best_score:
'''
Save your model here...
'''

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string