Tensorflow: Input Pipeline very slow / does not scale - multithreading

I'm trying to set up a Tensorflow input pipeline for feeding images into an AlexNet for feature extraction (not for training, this is a one of thing). Since AlexNet is rather small it is crucial to provide input data at a high rate for achieving acceptable performance (~1000 images / second).
My images are 400x300 JPEGs with 24KB per image on average.
Unfortunately it seems, that the Tensorflow input pipeline can't keep up with a GTX1080 running the AlexNet.
My input pipeline is simple: load a file, decode the image, resize it and batch them.
I created a small benchmark to show the issue:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import time
import glob
import os
IMAGE_DIR = 'images'
EPOCHS = 1
def main():
print('batch_size\tnum_threads\tms/image')
for batch_size in [16, 32, 64, 128]:
for num_threads in [1, 2, 4, 8]:
run(batch_size, num_threads)
def run(batch_size, num_threads):
filenames = glob.glob(os.path.join(IMAGE_DIR, '*.jpg'))
(filename,) = tf.train.slice_input_producer(
[filenames],
capacity=2 * batch_size * num_threads,
num_epochs=EPOCHS)
raw = tf.read_file(filename)
decoded = tf.image.decode_jpeg(raw, channels=3)
resized = tf.image.resize_images(decoded, [227, 227])
batch = tf.train.batch(
[resized],
batch_size,
num_threads,
2 * batch_size * num_threads,
enqueue_many=True)
init_op = tf.group(
tf.global_variables_initializer(),
tf.local_variables_initializer())
with tf.Session() as sess:
sess.run(init_op)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
t = time.time()
try:
while not coord.should_stop():
sess.run(batch)
except tf.errors.OutOfRangeError:
pass
finally:
coord.request_stop()
tpe = (time.time() - t) / (len(filenames) * EPOCHS) * 1000
print('{: <11}\t{: <10}\t{: <8}'
.format(batch_size, num_threads, tpe))
coord.join(threads)
if __name__ == "__main__":
main()
Running this on a MacBook Pro (early 2015, 2,9 GHz Intel Core i5) yields the following results:
batch_size num_threads ms/image
16 1 4.81571793556
16 2 3.00584602356
16 4 2.94281005859
16 8 2.94555711746
32 1 3.51123785973
32 2 1.82255005836
32 4 1.85884213448
32 8 1.88741898537
64 1 2.9537730217
64 2 1.58108997345
64 4 1.57125210762
64 8 1.57615303993
128 1 2.71797513962
128 2 1.67120599747
128 4 1.6521999836
128 8 1.6885869503
It shows overall bad performance far from 1/ms per image. Also it does not scale beyond two threads, which in this case is to be expected since it is a dual core processor only.
Running the same benchmark on a 2.5Ghz AMD Opteron 6180 SE with 24 cores yields the following:
batch_size num_threads ms/image
16 1 13.983194828
16 2 6.80965399742
16 4 6.67097783089
16 8 6.63090395927
32 1 12.0395629406
32 2 5.72535085678
32 4 4.94155502319
32 8 4.99696803093
64 1 10.9073989391
64 2 4.96317911148
64 4 3.76832485199
64 8 3.82816386223
128 1 10.2617599964
128 2 5.20488095284
128 4 3.16122984886
128 8 3.51550602913
Here, too, single threaded / overall performance is very bad and it does not scale beyond 2/4 threads.
The systems are neither IO nor CPU bound in any of the cases. For both systems loading and resizing the images with OpenCV gives far better numbers (~0.86ms/image in the MacBook, which in this case is CPU bound and up to ~0.22ms/image on the server, which in this case is IO bound).
What's going on with Tensorflow here? How can I speed this up?
I already tried to assemble a batch of images manually and use enqueue_many for batching, this made things even worse. I tried to add a small sleep before running the loop, just to make sure the queues are filled - but no luck.
Any help is greatly appreciated.

Related

Create 2D numpy array from buffer

Consider a system with n_channels transmitting n_samples at a given sampling rate. The 1D buffer containing the timestamps and the 2D buffer containing (n_channels, n_samples) is:
from ctypes import c_double, c_float
# Assume a 2-second window, 3 channels, sampled at 1024 Hz
# data: (n_channels, n_samples) = (3, 2048)
# timestamps: (n_samples,) = (2048,)
n_channels = 3
n_samples = 2048
n_data_values = n_channels * n_samples
data_buffer = (c_float * n_data_values)()
ts_buffer = (c_double * n_samples)()
I have a C++ binary library that fills the buffer. The function can be summarized as:
from ctypes import byref
fill_buffers(
byref(data_buffer),
byref(ts_buffer),
)
At this point, I have 2 filled buffers, one with 2048 elements (timestamps) and one with 3* 2048 elements (data). I want to load as efficiently as possible those 2 buffers in a numpy array.
np.frombuffer seems amazing to read 1D array, e.g. the timestamps, but I can't find a counterpart for N-dim arrays.
# read from buffer for the 1D array
timestamps = np.frombuffer(ts_buffer) # 192 ns ± 1.11 ns per loop
timestamps = np.array(ts_buffer) # 854 ns ± 2.99 ns per loop
For now, the data array is loaded with:
data = np.array(data_buffer).reshape(-1, n_channels, order="C").T
Any way to use the same efficient method as np.frombuffer while providing the output shape and the order?
This question is different from How can I initialize a NumPy array from a multidimensional buffer? and from How to restore a 2-dimensional numpy.array from a bytestring? since it does not focus on an alternative to np.frombuffer, but an alternative as efficient.
EDIT: Why is np.frombuffer(data_buffer).reshape(-1, n_channels).T not working? With 3 channels and 1024 points (to speed-up my testing), I get len(data_buffer) = 3072, but:
np.array(data_buffer).reshape(-1, 3).T.size = 3072
np.frombuffer(data_buffer).reshape(-1, 3).T.size = 1536
The application is a LabStreamingLayer buffer. The buffer is filled here https://github.com/labstreaminglayer/liblsl-Python/blob/87276974a311bcf7ceb3383e9d04c6bdcf302771/pylsl/pylsl.py#L854-L861
using the C++ library https://github.com/sccn/liblsl with specifically this function https://github.com/sccn/liblsl/blob/08aa186326e9a339316b7d5677ef31b3651b4aad/src/lsl_inlet_c.cpp#L180-L185
Does np.frombuffer(data_buffer, dtype=c_float).reshape(-1, n_channels, order="C").T not work correctly? As you are doing it np.array treats the buffer as a 1D array until you reshape it anyways.
For me the following code produces the right shapes. (Hard to verify if it works correctly without a MWE for the data that should be in the buffers).
import numpy as np
from ctypes import c_double, c_float
# Assume a 2-second window, 3 channels, sampled at 1024 Hz
# data: (n_channels, n_samples) = (3, 2048)
# timestamps: (n_samples,) = (2048,)
n_channels = 3
n_samples = 2048
n_data_values = n_channels * n_samples
data_buffer = (c_float * n_data_values)() # Note that c_float is typically 32 bytes while c_double and numpy's default is 64 bytes
ts_buffer = (c_double * n_samples)()
# Create a mock buffer
input_data = np.arange(0,n_data_values, dtype=c_float)
input_data_buffer = input_data.tobytes()
timestamps = np.frombuffer(ts_buffer)
# Note to specify the data type for the array of floats
data = np.frombuffer(input_data_buffer, dtype=c_float).reshape(-1, n_channels, order="C").T
# data has values 0,1,2 for first time point, 3,4,5 for second, and so on

How does the queue in Pytorch DataLoader work with num_workers >= 2?

Do I understand the following correctly?
When num_workers >=1, the main process pre-loads prefetch_factor * num_workers batches. When the training loop consumes one batch, the corresponding worker loads the next batch in its queue.
If this is the case, let's go through an example.
NOTE: I have chosen the numeric values for illustration purposes and
have ignored various overheads in this example. The accompanying code example uses these numbers.
Say I have num_workers=4, prefetch_factor=4, batch_size=128. Further assume, it takes 0.003125 s to fetch an item from a source database and the train step takes 0.05 s.
Now, each batch would take 0.003125 * 128 = 0.4 s to load.
With a prefetch_factor=4 and num_workers=4, first, 4*4=16 batches will be loaded.
Once the 16 batches are loaded, the first train step consumes 1 batch and takes 0.05 s. Say worker[0] provided this batch and will start the process to generate a new batch to replenish the queue. Recall fetching a new batch takes 0.4 s.
Similarly, the second step consumes one more batch and the corresponding worker (worker[1] in this example) starts the data fetching process.
The first 8 train steps would take 0.05*8=0.4s. By this time, 8 batches have been
consumed and worker[0] has produced 1 batch. In the next step, 1 batch is consumed and worker[1] produces a new batch. worker[1] had started the data fetching process in the second train step which would now be completed.
Following this we can see, each subsequent train step will consume 1 batch and one of the workers will produce 1 batch, keeping the dataloader queue to have always 8 batches. This means that the train step is never waiting for the data loading process as there are always 8 batches in the buffer.
I would expect this behavior regardless of the data size of the batch given num_workers, prefetch_factor are large enough. However, in the following code example that is not case.
In the code below, I define a custom iterable that returns a numpy array. As the size of the numpy array increases, increasing num_worker or 'prefetch_factor' does not improve the time taken for running through a batch.
I'm guessing this is because each worker serializes the batch to send to the main process where it is de-serialized. As the data size increase, this process would take more time. However, I would think if the queue size is large enough (num_workers, prefetch_factor), at some point, there should be a break even point where each training step consumption of a batch would be accompanied by replenishment via one of the workers as I illustrated in the above example.
In the code below, when MyIterable returns a small object (np array of size (10, 150)), increasing num_workers helps as expected. But when the returned object is larger (np array of size (1000, 150)), num_workers or prefetch_factor does not do much.
# small np object
avg time per batch for num workers=0: 0.47068126868714444
avg time per batch for num workers=2: 0.20982365206225495
avg time per batch for num workers=4: 0.10560789656221914
avg time per batch for num workers=6: 0.07202646931250456
avg time per batch for num workers=8: 0.05311137337469063
# large np object
avg time per batch for num workers=0: 0.6090951558124971
avg time per batch for num workers=2: 0.4594530961876444
avg time per batch for num workers=4: 0.45023533212543043
avg time per batch for num workers=6: 0.3830978863124983
avg time per batch for num workers=8: 0.3811495694375253
Am I missing something here? Why doesn't the data loader queue have enough buffer such that data loading is not the bottleneck?
Even if the serialization and de-serialization process would take longer for the latter case, I'd expect to have a large enough buffer where the consumption and replenishment rate of the batches are almost equal. Otherwise, what is the point of having prefetch_factor.
If the code is behaving as expected, are there any other ways to pre-load the next n batches in a buffer such that it is large enough and never depleted?
Thanks
import time
import torch
import numpy as np
from time import sleep
from torch.utils.data import DataLoader, IterableDataset
def collate_fn(records):
# some custom collation function
return records
class MyIterable(object):
def __init__(self, n):
self.n = n
self.i = 0
def __iter__(self):
return self
def __next__(self):
if self.i < self.n:
sleep(0.003125) # simulates data fetch time
# return np.random.random((10, 150)) # small data item
return np.random.random((1000, 150)) # large data item
else:
raise StopIteration
class MyIterableDataset(IterableDataset):
def __init__(self, n):
super(MyIterableDataset).__init__()
self.n = n
def __iter__(self):
return MyIterable(self.n)
def get_performance_metrics(num_workers):
ds = MyIterableDataset(n=10000)
if num_workers == 0:
dl = torch.utils.data.DataLoader(ds, num_workers=0, batch_size=128, collate_fn=collate_fn)
else:
dl = torch.utils.data.DataLoader(ds, num_workers=num_workers, prefetch_factor=4, persistent_workers=True,
batch_size=128, collate_fn=collate_fn,
multiprocessing_context='spawn')
warmup = 5
times = []
t0 = time.perf_counter()
for i, batch in enumerate(dl):
sleep(0.05) # simulates train step
e = time.perf_counter()
if i >= warmup:
times.append(e - t0)
t0 = time.perf_counter()
if i >= 20:
break
print(f'avg time per batch for num workers={num_workers}: {sum(times) / len(times)}')
if __name__ == '__main__':
num_worker_options = [0, 2, 4, 6, 8]
for n in num_worker_options:
get_performance_metrics(n)

How long does load_dataset take time in huggingface?

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code:
import datasets
from t5_tokenizer_model import SentencePieceUnigramTokenizer
vocab_size = 32_000
input_sentence_size = None
# Initialize a dataset
dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train")
tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>")
# Build an iterator over this dataset
def batch_iterator(input_sentence_size=None):
if input_sentence_size is None:
input_sentence_size = len(dataset)
batch_length = 100
for i in range(0, input_sentence_size, batch_length):
yield dataset[i: i + batch_length]["text"]
# Train tokenizer
tokenizer.train_from_iterator(
iterator=batch_iterator(input_sentence_size=input_sentence_size),
vocab_size=vocab_size,
show_progress=True,
)
# Save files to disk
tokenizer.save("./persian-t5-base/tokenizer.json")
For the downloading part the message is:
Downloading and preparing dataset oscar/unshuffled_deduplicated_fa (download: 9.74 GiB, generated: 37.24 GiB, post-processed: Unknown size, total: 46.98 GiB) to /root/.cache/huggingface/datasets/oscar/unshuffled_deduplicated_fa/1.0.0/...
I am running it on Google Colab Pro (with High Ram setting and on TPU). However, it's about 2 hours and the execution line is still on load_datset
what is doing? is it normal for load_dataset to take so much time? Should I interrupt it an run it again?

How to have a large matrix of tf.zero shape

print("sequences",len(sequences))
print("seq_length",(seq_length))
print("vocab size",(vocab_size))
X = tf.zeros((len(sequences), seq_length, vocab_size), dtype=tf.bool)
y = tf.zeros((len(sequences), vocab_size), dtype=tf.bool)
Output
sequences 30373553
seq_length 30
vocab size 1290174
ResourceExhaustedError Traceback (most recent call last)
<ipython-input-35-1bd9b1544ba0> in <module>()
2 print("seq_length",(seq_length))
3 print("vocab size",(vocab_size))
----> 4 X = tf.zeros((len(sequences), seq_length, vocab_size), dtype=tf.bool)
5 y = tf.zeros((len(sequences), vocab_size), dtype=tf.bool)
6
3 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
ResourceExhaustedError: OOM when allocating tensor with shape[30373553,30,1290174] and type bool on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:Fill] name: zeros/
Working on tensorflow 2.0
I want to make a matrix of zeroes of shape [30373553,30,1290174]
when runing the same code on TensorFlow 1.5 there was no such error but giving this error when working on Tensorflow 2.0
Assuming each bool element uses 1 byte of memory, your tensor of shape [30373553, 30, 1290174] will take about 1200 TB of memory to materialize. That's a lot of memory...
I'm guessing that this didn't error out in TensorFlow 1.5 because of the old deferred-execution paradigm, where you can call tf.zeros([30373553, 30, 1290174]) without any issue because the symbolic tensor returned by the call won't be actually allocated in memory until you call tf.Session.run() on a tf.Graph that contains the tensor. In TensorFlow 2.0, however, eager execution will perform the memory allocation as soon as the call is made.

Why does creating a single tensor on the GPU take 2.5 seconds in PyTorch?

I'm just going through the beginner tutorial on PyTorch and noticed that one of the many different ways to put a tensor (basically the same as a numpy array) on the GPU takes a suspiciously long amount compared to the other methods:
import time
import torch
if torch.cuda.is_available():
print('time =', time.time())
x = torch.randn(4, 4)
device = torch.device("cuda")
print('time =', time.time())
y = torch.ones_like(x, device=device) # directly create a tensor on GPU => 2.5 secs??
print('time =', time.time())
x = x.to(device) # or just use strings ``.to("cuda")``
z = x + y
print(z)
print(z.to("cpu", torch.double)) # ``.to`` can also change dtype together!
a = torch.ones(5)
print(a.cuda())
print('time =', time.time())
else:
print('I recommend you get CUDA to work, my good friend!')
Output (just times):
time = 1551809363.28284
time = 1551809363.282943
time = 1551809365.7204516 # (!)
time = 1551809365.7236063
Version details:
1 CUDA device: GeForce GTX 1050, driver version 415.27
CUDA = 9.0.176
PyTorch = 1.0.0
cuDNN = 7401
Python = 3.5.2
GCC = 5.4.0
OS = Linux Mint 18.3
Linux kernel = 4.15.0-45-generic
As you can see this one operation ("y = ...") takes much longer (2.5 seconds) than the rest combined (.003 seconds). I'm confused about this as I expect all these methods to basically do the same. I've tried making sure the types in this line are 32 bit or have different shapes but that didn't change anything.
When I re-order the commands, whatever command is on top takes 2.5 seconds. So this leads me to believe there is a delayed one-time setup of the device happening here, and future on-GPU allocations will be faster.

Resources