Develop an asynchronous generator in keras - multithreading

I am currently developing a CNN model in Keras using model.fit_generator and I currently have a generator developed using keras.utils.Sequences class. My problem is that looking over GPU utilization, it is not as high as it should be meaning the current model is CPU bottle-necked. I have played around with what the generator is doing to the data to make it more efficient, but it is still bottle-necked. My ideal situation is to have the generator continuously process the data and store it into memory (even single threaded) and for it to be put into the GPU when necessary. Essentially, I was wondering if there is a way to have the generator asynchronously process the data for an efficient generator method. Currently, the generator processes a batch, loads the batch into the GPU and waits for the GPU to finish. I have tweaked max_queue_size, workers, and use_multiprocessing, but nothing seems to have the GPU working to its full potential.

GPU usage may be either data transfer or computing. Try to find out which takes longer. Then you can better understand the batch size effects.

Related

Threading in BRMS - how many threads/cores should I use?

I currently have ~500 nested mixed-effects models to run, and one model takes ~2-3 hours to run using between-chain parallelization. However, I have access to a 64 core machine, thus I am looking to decrease the run time for my models using within-chain parallelization in brms. However, I am unsure how to set the number of threads and cores for each model. I am planning on using 4 chains, and want to maximize efficiency by reducing the run time as much as possible. Any guidance would be extremely helpful.

Training using pytorch Distributed Data Parallel with world_size 4 on a multi-gpu machine continues even when only two GPU processes started

I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like a backward pass to synchronize gradients. I used a 4-GPU machine and use environment variable CUDA_VISIBLE_DEVICES to make sure only 2 GPU processes can be started. If only 2 GPU processes started, I assume that at the end of the first batch, the synchronization on the existing 2 GPUS would wait on the other two and time out as the other two never started. What I observed is that the training continued with only 2 GPU processes, even though the world size is 4. How to explain this? Is my understanding not correct?

Can multiple tensorflow inferences run on one GPU in parallel?

I am trying to run Tensorflow as a serve on one NVIDIA Tesla V100 GPU. As a server, my program need to accept multiple requests concurrently. So, my questions are the following:
When multiple requests arrive at the same time, suppose we are not using batching, are these requests run on the GPU sequentially or in parallel? I understand independent processes have seperate CUDA contexts, which are run sequentially on the GPU. But these requests are actually different threads in the same process and should share one CUDA context. So according to the documentation, the GPU can run multiple kernels concurrently. If this is the true, does it mean if I have a large amount of requests arrive at the same time, the GPU utilization can go up to 100%? But this never happen in my experiment.
What is the difference between running one session in different threads vs. running different sessions in different threads? Which is the proper way to implement a Tensorflow server? Which one does Tensorflow Serving use?
Any advice will be appreciated. Thank you!
Regarding #1: all requests will be run on the same GPU sequentially, since TF uses a global single compute stream for each physical GPU device (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L284)
Regarding #2: in terms of multi-streaming, the two options are similar: by default multi-streaming is not enabled. If you want to experiment with multi-streams, you may try the virtual_device option (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L138)
Thanks.
For model inference, you may want to look at high performance inference engines like nvidia triton. It allows multiple model instances, each of which has dedicated cuda streams where GPU can exploit more parallelism.
See https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/architecture.html#concurrent-model-execution

Why Tensorflow creates so many CPU threads

Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.

Tensorflow while_loop for training #2

Here I asked how to solve overhead problem by using while_loop for training (which allow to evaluate train_op several time by call only one run). After that I create 4 thread and run one while_loop per thread for optimization in parallel. Is there native mechanism in TensorFlow for such parallel optimization?
I use Ftrl optimizer.
Thanks!
EDITED:
In my situation I have big data set, which I read gradually in main thread and enqueue to FIFOQueue. I use batch optimization and one optimization step on small batch (ideal only one element) takes little time (I use linear model), since that I want to do all optimization step in one run call, without returning to python interpreter on each step (because overhead problem). Now I call run as many times as number of threads.

Resources