I am trying to run Tensorflow as a serve on one NVIDIA Tesla V100 GPU. As a server, my program need to accept multiple requests concurrently. So, my questions are the following:
When multiple requests arrive at the same time, suppose we are not using batching, are these requests run on the GPU sequentially or in parallel? I understand independent processes have seperate CUDA contexts, which are run sequentially on the GPU. But these requests are actually different threads in the same process and should share one CUDA context. So according to the documentation, the GPU can run multiple kernels concurrently. If this is the true, does it mean if I have a large amount of requests arrive at the same time, the GPU utilization can go up to 100%? But this never happen in my experiment.
What is the difference between running one session in different threads vs. running different sessions in different threads? Which is the proper way to implement a Tensorflow server? Which one does Tensorflow Serving use?
Any advice will be appreciated. Thank you!
Regarding #1: all requests will be run on the same GPU sequentially, since TF uses a global single compute stream for each physical GPU device (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L284)
Regarding #2: in terms of multi-streaming, the two options are similar: by default multi-streaming is not enabled. If you want to experiment with multi-streams, you may try the virtual_device option (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L138)
Thanks.
For model inference, you may want to look at high performance inference engines like nvidia triton. It allows multiple model instances, each of which has dedicated cuda streams where GPU can exploit more parallelism.
See https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/architecture.html#concurrent-model-execution
Related
This question can be viewed related to my other question.
I tried running multiple machine learning processes in parallel (with bash). These are written using PyTorch. After a certain number of concurrent programs (10 in my case), I get the following error:
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
As mentioned in this answer,
...it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).
For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.
I tried the solution mentioned here, to enforce a per-process GPU memory usage limit, but this issue persists.
This problem does not occur with a single process, or a fewer number of processes. Since only one context runs at a single time instant, why does this cause memory issue?
This issue occurs with/without MPS. I thought it could occur with MPS, but not otherwise, as MPS may run multiple processes in parallel.
Since only one context runs at a single time instant, why does this cause memory issue?
Context-switching doesn't dump the contents of GPU "device" memory (i.e. DRAM) to some other location. If you run out of this device memory, context switching doesn't alleviate that.
If you run multiple processes, the memory used by each process will add up (just like it does in the CPU space) and GPU context switching (or MPS or time-slicing) does not alleviate that in any way.
It's completely expected that if you run enough processes using the GPU, eventually you will run out of resources. Neither GPU context switching nor MPS nor time-slicing in any way affects the memory utilization per process.
I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like a backward pass to synchronize gradients. I used a 4-GPU machine and use environment variable CUDA_VISIBLE_DEVICES to make sure only 2 GPU processes can be started. If only 2 GPU processes started, I assume that at the end of the first batch, the synchronization on the existing 2 GPUS would wait on the other two and time out as the other two never started. What I observed is that the training continued with only 2 GPU processes, even though the world size is 4. How to explain this? Is my understanding not correct?
Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.
Given a cluster of several nodes, each of which hosts multiple-core processor, is there any advantage of using MPI between nodes and OpenMP/pthreads within nodes over using pure all-MPI? If I understand correctly, if I run an MPI-program on a single node and indicate the number of processes equal to the number of cores, then I will have an honest parallel MPI-job of several processes running on separate cores. So why bother about hybrid parallelization using threads within nodes and MPI only between nodes? I have no question in case of MPI+CUDA hybrid, as MPI cannot employ GPUs, but it can employ CPU cores, so why use threads?
Using a combination of OpenMP/pthread threads and MPI processes is known as Hybrid Programming. It is tougher to program than pure MPI but with the recent reduction in latencies with OpenMP, it makes a lot of sense to use Hybrid MPI. Some advantages are:
Avoiding data replication: Since threads can share data within a node, if any data needs to be replicated between processes, we can avoid this.
Light-weight : Threads are lightweight and thus you reduce the meta-data associated with processes.
Reduction in number of messages : A single process within a node can communicate with other processes, reducing number of messages between nodes (and thus reducing pressure on the Network Interface Card). The number of messages involved in collective communication is notable.
Faster communication : As pointed out by #user3528438 above, since threads communicate using shared memory, you can avoid using point-to-point MPI communication within a node. A recent approach (2012) recommends using RMA shared memory instead of threads within a node - this model is called MPI+MPI (search google scholar using MPI plus MPI).
But Hybrid MPI has its disadvantages as well but you asked only about the advantages.
This is in fact a much more complex question that it looks like.
It depends of lot of factor. By experience I would say: You are always happy to avoid hibrid openMP-MPI. Which is a mess to optimise. But there is some momement you cannot avoid it, mainly dependent on the problem you are solving and the cluster you have access to.
Let say you are solving a problem highly parallelizable and you have a small cluster then Hibrid will be probably useless.
But if you have a problem which lets says scale well up to N processes but start to have a very bad efficiency at 4N. And you have access to a cluster with 10N cores... Then hybridization will be a solution. You will use a little amount of thread per MPI processes something like 4 (It is known that >8 is not efficient).
(its fun to think that on KNL most people I know use 4 to 8 Thread per MPI process even if one chip got 68 cores)
Then what about hybrid accelerator/openMP/MPI.
You are wrong with accelerator + MPI. As soon as you start to used a cluster which has accelerators you will need to use someting like openMP/MPI or CUDA/MPI or openACC/MPI as you will need to communicate between devices. Nowadays you can bypass the CPU using Direct GPU (at least for Nvidia, not clue for other builder but I expect that it would be the case). Then usually you will use 1 MPI process per GPU. Most cluster with GPU will have 1 socket and N accelerators (N
I am using nodejs for a CPU intensive task ,which basicly generates large amount of data and stores it in a file. I am streaming the data to output files as it is generated for a single type of data.
Aim : I want to make the task of generating this data for multiple types of data in parallel (utilizing my multi-core cpu to its best).Without each of process having its own heap memory .Thus providing with larger process memory and increased speed of execution.
I was planning to use node fibers which is also used by meteor js for its own callback handling.But I am not sure if this will achieve what I want,as in one of the video on meteor fibers by Chris Mather mentions at the end that eventually everything is single threaded and node fibers somehow manges the same single threaded event loop to provide its functionality.
So,
Does this mean that if I use node fibers I wont be running my task in
parallel ,thus not utilizing my cpu cores ?
Does node webworker-threads will help me in achieving the
functionality I desire.As is mentioned on modules home page which
says that ,webworker threads will run on seperate/parallel cpu
process ,thus providing multi-threading in real sense ??
As ending question ,Does this mean that node.js is not advisable for
such CPU intensive tasks ?
note : I dont want to use asynchronous code structuring libs which are presented as threads,but infact just add syntatical sugar over same async code, as the tasks are largely CPU intensive .I have already used async capabilities to max .
// Update 1 (based on answer for clusters )
Sorry I forgot to mention this ,but problem with clusters I faced is :
Complex to load balance the amount of work I have in a way which makes sure a particular set of parallel tasks execute before certain other tasks.
Not sure if clusters really do what I want ,referring to these lines on webworker-threads npm homepage
The "can't block the event loop" problem is inherent to Node's evented model. No matter how many Node processes you have running as a Node-cluster, it won't solve its issues with CPU-bound tasks.
..... any light on how ..would be helpfull.
Rather than trying to implement multiple threads, you should find it much easier to use multiple processes with Node.js
See, for example, the cluster module. This allows you to easily run the same js code in multiple processes, e.g. one per core, and collect their results / be notified once they're completed.
If cluster does more than you need, then you can also just call fork directly.
If you must have thread-parallelism rather than process-, then you may want to look at writing an async native module. Then you have access to the libuv thread pool (though starving it may reduce I/O performance) or can fork your own threads as you wish (but then you're on your own for synchronising with the rest of Node).
After update 1
For load balancing, if what cluster does isn't working for you, then you can just do it yourself with fork, as I mentioned. The source for cluster is available.
For the other point, it means if the task is truly CPU-bound then there's no advantage Node will give you over other technologies, other than being simpler if everything else is using Node. The only option you have is to make sure you're using all the available CPU resources, which a worker pool will give you. If you're already using Node then the easiest options are using the ones it's already got (cluster or libuv). If they're not sufficient then yeah, you'll have to find something else.
Regardless of technology, it remains true that multi-process parallelism is a lot easier than multi-thread parallelism.
Note: despite what you say, you definitely do want to use async code precisely because it is CPI-intensive, otherwise your tasks will block all I/O. You do not want this to happen.