Caffe loading LMDB batches very slowly

Caffe loading LMDB batches very slowly - linux

I generated an LMDB database using the SSD-Caffe fork here. I have successfully generated the VOC LMDB trainval/test LMDB directories and am able to train the model.
However, during training, it takes inordinatly long to load data from the LMDB database. For example when profiling using Caffe's time function using this command:
ssdcaffe time --model "jobs/VGGNet/VOC0712/SSD_300x300/train.prototxt" --gpu 0 --iterations 20
I get that the forward pass takes on average 8.9s, and the backward pass takes on average 0.5s. On a layer-by-layer inspection, the data injestion layer takes the bulk of that time at 8.7s. See below:
I1129 10:14:11.094445 8011 caffe.cpp:404] data forward: 8660.38 ms.
...
I1129 10:14:11.095383 8011 caffe.cpp:412] Average Forward pass: 8933.31 ms.
I1129 10:14:11.095389 8011 caffe.cpp:414] Average Backward pass: 519.549 ms.
If I half the batchsize from 32 to 16, then the data injestion layer time decreases roughly in half:
I1129 10:20:07.975527 8093 caffe.cpp:404] data forward: 3906.53 ms.
This is clearly not the intended speed, and something is wrong. Any help would be greatly appreciated!

Found my issue:
My images were too big. The standard VOC images which the repo used were ~350x500 pixels, whereas my images were 1080x1920. When I downsized my images by 3x (eg 9x less pixels), my data ingestion layer took only 181ms (a 48x speedup over previous time of 8.6s)

Related

Keras training with early stopping: how does it work when data is distributed?

In Keras (TF 2.4.1) I'm training a model on Google AI Platform. The job runs on a cluster with 1 master and 1 worker. Each machine type is complex_model_m_gpu that includes four NVIDIA Tesla K80 GPUs. My job is configured to stop early based on a metric that I calculate at each epoch (recall#k). When I look at the logs after training finishes I can see that my metric is calculated two times at each epoch and that subsequent tests to determine if metric has improved or not are made on "parallel tracks", each track not knowing the other. For example at epoch 1 I get two numbers: 0.13306 and 0.12903. Later at epoch 3, I get 0.17 and 0.11; 0.17 is compared to 0.13306 and 0.11 to 0.12903 (see image below, read from bottom to top)
Why two numbers? It's like if the master and the worker are calculating the metric each separately. Is there a way to get only the global measure and to determine the improvement only on this global number?
By the way when I look at my scalar graphs in Tensorboard, my graphs are jumbled. Is it because I get multiple numbers at each epoch on a machine with multiple devices?
EDIT: I tried the same on a single machine (1 master, no worker) and this time I see only one number and my tensorboard graphs are no more jumbled. I've just realized that a master and a worker configuration probably needs something different in my code (a tf.distribute.MultiWorkerMirroredStrategy instead of a MirroredStrategy). I have to investigate that. Ref: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras

Is there a PyTorch with CUDA Unified GPU-CPU Memory fork?

So Training a DNN model can be a pain when a batch of one image takes 15GB. Speed is not so important for me, yet to fit bigger batches (and models is). So I wonder if there is a PyTorch with CUDA Unified Memory fork or something like that to fit giant models (having 16gb per GPU RAM, yet 250 on CPU side it seems quite resonable)?

If you do not care about the time it takes, but need large batches, you can use a more slow approach. Say you need to have a batch of 128 samples but your gpu memory fits only 8 samples. You can create smaller batches of 8 samples and then average their gradients.
For each small batch of 8 samples that you evaluate, you keep the .grad of each parameter in your cpu memory. You keep a list of grads for each of your models parameters. After you have gathered the grads for 16 batches of 8 samples (128 samples in total) you can average the gradients of each parameter and put the result back into the .grad attribute of each parameter.
You can then call the .step() of your optimizer. This should yield exactly the same results as if you were using a large batch of 128 samples.

What is the most efficient way to load data into Tensorflow for real time inference?

When setting up a data input pipeline to Tensorflow (web cam images), a large amount of time is spent loading the data from the system RAM to the GPU memory.
I am trying to feed a constant stream of images (1024x1024) through my object detection network. I'm currently using a V100 on AWS to perform inference.
The first attempt was with a simple feed dict operation.
# Get layers
img_input_tensor = sess.graph.get_tensor_by_name('import/input_image:0')
img_anchors_input_tensor = sess.graph.get_tensor_by_name('import/input_anchors:0')
img_meta_input_tensor = sess.graph.get_tensor_by_name('import/input_image_meta:0')
detections_input_tensor = sess.graph.get_tensor_by_name('import/output_detections:0')
detections = sess.run(detections_input_tensor,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
This produced inference times around 0.06 ms per image.
However, after reading the Tensorflow manual I noticed that the tf.data API was recommended for loading data for inference.
# setup data input
data = tf.data.Dataset.from_tensors((img_input_tensor, img_meta_input_tensor, img_anchors_input_tensor, detections_input_tensor))
iterator = data.make_initializable_iterator() # create the iterator
next_batch = iterator.get_next()
# load data
sess.run(iterator.initializer,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
# inference
detections = sess.run([next_batch])[0][3]
This sped up inference time to 0.01ms, put the time taken to load the data took 0.1 ms. This Iterator methods is much longer than the 'slower' feed_dict method significantly. Is there something I can do to speed up the loading process?

Here is a great guide on data pipeline optimization. I personally find the .prefetch method the easiest way to boost your input pipeline. However, the article provides much more advanced techniques.
However, if your input data is not in tfrecords, but you feed it by yourself, you have to implement the described techniques (buffering, interleaved operations) somehow by yourself.

What's the difference between "samples_per_epoch" and "steps_per_epoch" in fit_generator

I was confused by this problem for several days...
My question is that why the training time has such massive difference between that I set the batch_size to be "1" and "20" for my generator.
If I set the batch_size to be 1, the training time of 1 epoch is approximately 180 ~ 200 sec.
If I set the batch_size to be 20, the training time of 1 epoch is approximately 3000 ~ 3200 sec.
However, this horrible difference between these training times seems to be abnormal..., since it should be the reversed result:
batch_size = 1, training time -> 3000 ~ 3200 sec.
batch_size = 20, training time -> 180 ~ 200 sec.
The input to my generator is not the file path, but the numpy arrays which are already loaded into the
memory via calling "np.load()".
So I think the I/O trade-off issue doesn't exist.
I'm using Keras-2.0.3 and my backend is tensorflow-gpu 1.0.1
I have seen the update of this merged PR,
but it seems that this change won't affect anything at all. (the usage is just the same with original one)
The link here is the gist of my self-defined generator and the part of my fit_generator.

When you use fit_generator, the number of samples processed for each epoch is batch_size * steps_per_epochs. From the Keras documentation for fit_generator: https://keras.io/models/sequential/
steps_per_epoch: Total number of steps (batches of samples) to yield from generator before declaring one epoch finished and starting the next epoch. It should typically be equal to the number of unique samples of your dataset divided by the batch size.
This is different from the behaviour of 'fit', where increasing batch_size typically speeds up things.
In conclusion, when you increase batch_size with fit_generator, you should decrease steps_per_epochs by the same factor, if you want training time to stay the same or lower.

Let's clear it :
Assume you have a dataset with 8000 samples (rows of data) and you choose a batch_size = 32 and epochs = 25
This means that the dataset will be divided into (8000/32) = 250 batches, having 32 samples/rows in each batch. The model weights will be updated after each batch.
one epoch will train 250 batches or 250 updations to the model.
here steps_per_epoch = no.of batches
With 50 epochs, the model will pass through the whole dataset 50 times.
Ref - https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

You should also take into account the following function parameters when working with fit_generator:
max_queue_size, use_multiprocessing and workers
max_queue_size - might cause to load more data than you actually expect, which depending on your generator code may do something unexpected or unnecessary which can slow down your execution times.
use_multiprocessing together with workers - might spin-up additional processes that would lead to additional work for serialization and interprocess communication. First you would get your data serialized using pickle, then you would send your data to that target processes, then you would do your processing inside those processes and then the whole communication procedure repeats backwards, you pickle results, and send them to the main process via RPC. In most cases it should be fast, but if you're processing dozens of gigabytes of data or have your generator implemented in sub-optimal fashion then you might get the slowdown you describe.

The whole thing is:
fit() works faster than fit_generator() since it can access data directly in memory.
fit() takes numpy arrays data into memory, while fit_generator() takes data from the sequence generator such as keras.utils.Sequence which works slower.

Finding time of testing data in Sphinx train.

I am training data via pocketsphinx and sphinxtrain. We can see our training data time in log file. like my current training data is shown as
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 1.00766111111111
After training, testing is done. for testing I have added 20 files. But I dont know what is length of these files. Finding it manually is a hard task as I am going to increase testing data.
So is there any log file or any other (than manual) way I can check my testing data time.

I just found it, I am posting own answer so it may be helping for others
You can find it under logdir/decode/dbname-1-1.log
while dbname is your main folder name in my case logdir/decode/tester-1-1.log.
Open this file and there will be a line
INFO: batch.c(778): TOTAL 81.24 seconds speech, 30.43 seconds CPU, 37.54 seconds wall
Here TOTAL 81.24 seconds speech is time of my testing audio data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string