I use AWS SageMaker to run a training with AllenNLP. In order to track the loss and metrics I need to have them printed on the log INFO level during training (or at least after each epoch). However, when I run the training all loss and metric information is printed to the console without using the logger:
2021-07-12 16:39:25,799 - INFO - allennlp.training.trainer - Epoch 2/24
2021-07-12 16:39:25,803 - INFO - allennlp.training.trainer - Worker 0 memory usage: 1.5G
2021-07-12 16:39:25,806 - INFO - allennlp.training.trainer - Training
accuracy: 0.1116, batch_loss: 0.4598, loss: 0.4742 ||: 100%|##########| 8/8 [00:13<00:00, 1.64s/it]
2021-07-12 16:39:40,229 - INFO - allennlp.training.trainer - Validating
accuracy: 0.2000, batch_loss: 0.4377, loss: 0.4215 ||: 100%|##########| 2/2 [00:03<00:00, 1.87s/it]
So far, I could not find anything in the issues or on StackOverflow. As I said, having the loss and metrics logged on INFO level once per epoch would be totally fine.
Also in this example it seems like the loss and metrics are logged the way I would like to have it.
Related
In Keras (TF 2.4.1) I'm training a model on Google AI Platform. The job runs on a cluster with 1 master and 1 worker. Each machine type is complex_model_m_gpu that includes four NVIDIA Tesla K80 GPUs. My job is configured to stop early based on a metric that I calculate at each epoch (recall#k). When I look at the logs after training finishes I can see that my metric is calculated two times at each epoch and that subsequent tests to determine if metric has improved or not are made on "parallel tracks", each track not knowing the other. For example at epoch 1 I get two numbers: 0.13306 and 0.12903. Later at epoch 3, I get 0.17 and 0.11; 0.17 is compared to 0.13306 and 0.11 to 0.12903 (see image below, read from bottom to top)
Why two numbers? It's like if the master and the worker are calculating the metric each separately. Is there a way to get only the global measure and to determine the improvement only on this global number?
By the way when I look at my scalar graphs in Tensorboard, my graphs are jumbled. Is it because I get multiple numbers at each epoch on a machine with multiple devices?
EDIT: I tried the same on a single machine (1 master, no worker) and this time I see only one number and my tensorboard graphs are no more jumbled. I've just realized that a master and a worker configuration probably needs something different in my code (a tf.distribute.MultiWorkerMirroredStrategy instead of a MirroredStrategy). I have to investigate that. Ref: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
Reading https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel & https://discuss.pytorch.org/t/how-does-enumerate-trainloader-0-work/14410 I'm trying to understand how training epochs behave in PyTorch.
Take this outer and inner loop :
for epoch in range(num_epochs):
for i1,i2 in enumerate(training_loader):
Is this a correct interpretation : ?
For each invocation of the outer loop/epoch the entire training set, in above example training_loader is iterated per batch. This means the model does not process one instance per training cycle. Per training cycle ( for epoch in range(num_epochs): ) the entire training set is processed in chunks/batches where the batch size is determined when creating training_loader
torch.utils.data.DataLoader returns an iterable that iterates over the dataset.
Therefore, the following -
training_loader = torch.utils.data.DataLoader(*args)
for i1,i2 in enumerate(training_loader):
#process
runs one over the dataset completely in batches.
I am using Tensorflow's Object Detection API to detect cars. It should detect the cars as one class "car".
I followed sentdex's following series:
https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/
System information:
OS - Ubuntu 18.04 LTS
GPU - Nvidia 940M (VRAM : 2GB)
Tensorflow : 1.10
Python - 3.6
CPU - Intel i5
Problem:
The training process runs pretty fine. In order to know when the model converges and when I should stop training, I observe the loss during the training per step in the terminal where the training is running and also observe the Total Loss graph in Tensorboard via running the following command in another terminal,
$tensorboard --logdit="training"
But even after training till 60k steps, the loss fluctuates between 2.1 to 1.2. If I stop the training and export the inference graph from the last checkpoint(saved in the training/ folder), it detects cars in some cases and in some it gives false positives.
I also tried running eval.py like below,
python3 eval.py --logtostderr --pipeline_config_path=training/ssd_mobilenet_v1_pets.config --checkpoint_dir=training/ --eval_dir=eval/
but it gives out an error that indicates that the GPU memory fails to run this script along with train.py.
So, I stop the training to make sure the GPU is free and then run eval.py but it creates only one eval point in eval/ folder. Why?
Also, how do I understand from the Precision graphs in Tensorboard that the training needs to be stopped?
I could also post screenshots if anyone wants.
Should I keep training till the loss stays on an average around 1?
Thanks.
PS: Added Total Loss graph below till 66k steps.
PS2: After 2 days training(and still on) this is the total loss graph below.
Usually, one uses a separate set of data to measure the error and generalisation abilities of the model. So, one would have the following sets of data to train and evaluate a model:
Training set: The data used to train the model.
Validation set: A separate set of data which will be used to measure the error during training. The data of this set is not used to perform any weight updates.
Test set: This set is used to measure the model's performance after the training.
In your case, you would have to define a separate set of data, the validation set and run an evaluation repeadingly after a fixed number of batches/steps and log the error or accuracy. What usually happens is, that the error on that data will decrease in the beginning and increase at a certain point during training. So it's important to keep track of that error and to generate a checkpoint whenever this error is decreases. The checkpoint with the lowest error on your validation data is one that you want to use. This technique is called Early Stopping.
The reason why the error increases after a certain point during training is called Overfitting. It tells you that the model losses it's ability to generalize to unseen data.
Edit:
Here's an example of a training loop with early stopping procedure:
for step in range(1, _MAX_ITER):
if step % _TEST_ITER == 0:
sample_count = 0
while True:
try:
test_data = sess.run(test_batch)
test_loss, summary = self._model.loss(sess, test_data[0], self._assign_target(test_data), self._merged_summary)
sess.run(self._increment_loss_opt, feed_dict={self._current_loss_pl: test_loss})
sample_count += 1
except tf.errors.OutOfRangeError:
score = sess.run(self._avg_batch_loss, feed_dict={self._batch_count_pl: sample_count})
best_score =sess.run(self._best_loss)
if score < best_score:
'''
Save your model here...
'''
I am using cloudwatch-exporter to scrape metrics from CloudWatch and expose them in its localhost:9106/metrics.
The configuration for this is the following:
region: us-east-1
set_timestamp: false
metrics:
- aws_namespace: AWS/CloudFront
aws_metric_name: TotalErrorRate
aws_statistics: [Average]
aws_dimensions: [DistributionId, Region]
aws_dimensions_select:
Region: [Global]
And I can indeed see the fetched metrics:
$> curl localhost:9106/metrics
# HELP aws_cloudfront_total_error_rate_average CloudWatch metric AWS/CloudFront TotalErrorRate Dimensions: [DistributionId, Region] Statistic: Average Unit: Percent
# TYPE aws_cloudfront_total_error_rate_average gauge
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="E1XXXXXX",} 26.666666666666668
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="EXXXXXXX",} 0.0
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="E38XXXXXX",} 0.0
aws_cloudfront_total_error_rate_average{job="aws_cloudfront",instance="",region="Global",distribution_id="E6XXXXXXX",} 100.0
# HELP cloudwatch_exporter_scrape_duration_seconds Time this CloudWatch scrape took, in seconds.
# TYPE cloudwatch_exporter_scrape_duration_seconds gauge
cloudwatch_exporter_scrape_duration_seconds 14.487444391
# HELP cloudwatch_exporter_scrape_error Non-zero if this scrape failed.
# TYPE cloudwatch_exporter_scrape_error gauge
cloudwatch_exporter_scrape_error 0.0
However, Prometheus does not scrape them, and outputs the following logs:
level=warn ts=2018-06-20T07:00:37.578384931Z caller=scrape.go:932 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://100.106.248.21:9106/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=24
level=warn ts=2018-06-20T07:01:36.821700134Z caller=scrape.go:932 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://100.106.248.21:9106/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=24
level=warn ts=2018-06-20T07:02:35.593731873Z caller=scrape.go:932 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://100.106.248.21:9106/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=24
Specifically:
msg="Error on ingesting samples that are too old or are too far into
the future"
My guess is that CloudWatch being located in Virginia, and our cloudwatch_exporter and Prometheus being in EU, there is a Timestamp difference preventing Prometheus to scrape these metrics.
Hence my guess to use set_timestamp: false as pointed in this Merge Request.
However, that does not work.
I am not a professional of Prometheus, and may be it ils wrongly configured. How can I further investigate?
I was confused by this problem for several days...
My question is that why the training time has such massive difference between that I set the batch_size to be "1" and "20" for my generator.
If I set the batch_size to be 1, the training time of 1 epoch is approximately 180 ~ 200 sec.
If I set the batch_size to be 20, the training time of 1 epoch is approximately 3000 ~ 3200 sec.
However, this horrible difference between these training times seems to be abnormal..., since it should be the reversed result:
batch_size = 1, training time -> 3000 ~ 3200 sec.
batch_size = 20, training time -> 180 ~ 200 sec.
The input to my generator is not the file path, but the numpy arrays which are already loaded into the
memory via calling "np.load()".
So I think the I/O trade-off issue doesn't exist.
I'm using Keras-2.0.3 and my backend is tensorflow-gpu 1.0.1
I have seen the update of this merged PR,
but it seems that this change won't affect anything at all. (the usage is just the same with original one)
The link here is the gist of my self-defined generator and the part of my fit_generator.
When you use fit_generator, the number of samples processed for each epoch is batch_size * steps_per_epochs. From the Keras documentation for fit_generator: https://keras.io/models/sequential/
steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.
This is different from the behaviour of 'fit', where increasing batch_size typically speeds up things.
In conclusion, when you increase batch_size with fit_generator, you should decrease steps_per_epochs by the same factor, if you want training time to stay the same or lower.
Let's clear it :
Assume you have a dataset with 8000 samples (rows of data) and you choose a batch_size = 32 and epochs = 25
This means that the dataset will be divided into (8000/32) = 250 batches, having 32 samples/rows in each batch. The model weights will be updated after each batch.
one epoch will train 250 batches or 250 updations to the model.
here steps_per_epoch = no.of batches
With 50 epochs, the model will pass through the whole dataset 50 times.
Ref - https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/
You should also take into account the following function parameters when working with fit_generator:
max_queue_size, use_multiprocessing and workers
max_queue_size - might cause to load more data than you actually expect, which depending on your generator code may do something unexpected or unnecessary which can slow down your execution times.
use_multiprocessing together with workers - might spin-up additional processes that would lead to additional work for serialization and interprocess communication. First you would get your data serialized using pickle, then you would send your data to that target processes, then you would do your processing inside those processes and then the whole communication procedure repeats backwards, you pickle results, and send them to the main process via RPC. In most cases it should be fast, but if you're processing dozens of gigabytes of data or have your generator implemented in sub-optimal fashion then you might get the slowdown you describe.
The whole thing is:
fit() works faster than fit_generator() since it can access data directly in memory.
fit() takes numpy arrays data into memory, while fit_generator() takes data from the sequence generator such as keras.utils.Sequence which works slower.