Tensorflow Communication Cross CPU to Multi GPUs - multithreading

From the tensorflow white paper and this answer from #mrry, tensorFlow adds Send/Recv ops to copy data cross device boundaries, and uses Rendezvous to do the actual work. The communication method cross different devices in tensorflow is noblocking send, blocking recv.
It seems (please correct me if wrong) after reading, parsing and batching from the input data queue, the batch examples will be sent from CPU to GPU if I use GPU to train.
I want to know when use multi GPU, the batch examples in CPU are how to send to multi GPUs? Is there several Send ops in CPU, each of which is match to one GPU, and each GPU get one batch_size example? or there is only one Send op in CPU?
From the document config.proto, inter_op_parallelism_threads option configure a thread pool in parallel execution, as the comments describe:
// Nodes that perform blocking operations are enqueued on a pool of
// inter_op_parallelism_threads available in each process.
Is it means blocking operations Recv can be done in multi-threads when inter_op_parallelism_threads option is set?
If the description of the questions is unclear, please ask me further. Thanks.

According to my understanding of tensorflow after looking into the source code a bit deeper:
There'are multiple send / recv ops. Since the send / recv op pairs are created for each cross-device edge.
Recv is an async-op which will never block the execution. The done callback will be called when data is ready for the recv op.

Related

Does a large Max Degree Of Parallelism cause queuing?

I would like to know if my understanding of setting a Max Degree Of Parallelism (MDOP) value larger than a machines available processor amount causes a queueing effect that I have described below.
Please see this as a purely I/O asynchronous operation:
A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.
If there is a requirement for 100 http end points to be called and the MDOP was set to 100, this will create 100 http request tasks at the same time all run in parallel. The problem is only 16 will ever be handled at once meaning the rest are effectively queued and will be handled once a processor frees resulting in an increased response time. Also to add, the process will be solved down further due to other parts of the system demanding use of the 16 available processors.
Setting the MDOP to half the available processer count (8 for example in a 16 processor machine) means that 8 http request tasks will be in flight at any one time. The response times of the 8 requests will be minimal due to there being no queueing of the tasks as the set MDOP is well under the machines available processor resources. Further to this there are also another 8 processors available to handle any other tasks required by the machine.
The main difference is that the overall response time for 100 calls will return faster with a MDOP of 100 as all 100 tasks were started at the same time, where as with 8 there are only ever 8 requests in flight at once.
The implicit assumption made in the question are not correct.
IO operations are generally far from saturating a core. Synchronous and asynchronous request works results in different behaviours. The former is not efficient and must not be used. Both should not be limited to the number of available cores but to the maximum concurrency of the target device completing the IO operations assuming the software stack is doing its job correctly.
For synchronous requests, most of the time is spent waiting for the operation to complete. For example, for a network operation, the OS send the request buffer to the NIC which send it asynchronously on network link. It takes some time to be sure data has been sent so the NIC needs to wait a bit for this sending request to mark it as completed. It also sometimes need to wait for the link to be ready. During this time, the processor can be free and it can actually queue new requests to the NIC. Not to mention the response from the request sent will take a significant time (during which neither the processor nor the link work for this specific request). When a synchronous operation needs to wait for the target device, the IO scheduler of the OS does a context switch (assuming the user code does a proper passive wait). This enable the processor to actually start new IO requests of other threads or overlap the IO requests with computation when the load is high. If there is not enough threads to do IO operations, then this is the main issue, not the number of cores itself. Increasing the number of thread is not efficient. It just increases the number of context switches and thread migration resulting in significant overheads. Asynchronous operations should be used instead. Regarding the OS stack, they may also causes many context switches, but they are generally more efficiently scheduled by the OS. Moreover, using asynchronous IO operations remove the artificial limitation of the number of threads (ie. the maximum degree of parallelism).
For asynchronous operations, one thread can starts a lot of IO requests before they can actually be completed. Having more cores does not directly means more requests can be completed in a given fixed time. This is only true if the OS IO stack is truly parallel and if the operations are limited by the OS stack rather than the concurrency of the target device (this tends to be true nowadays for example on SSD which are massively parallel). The thing is that modern processors are very fast so few threads should theoretically be enough to saturate the queue of most target device, although in practice, not all OS stacks are efficiently designed for modern IO devices.
Every software and hardware stack have a maximum degree of parallelism meant to saturate the device and so to mitigate the latency of IO requests. Because IO latency is generally high, IO request queues are large. "Queuing" do not mean much here since requests are eventually queued anyway. The question is whether they are queued in the OS stack and not the one of the device, that is if the degree Of parallelism of the software stack (including the OS) is bigger than the one of the target device (which may or may not truly compute incoming request of its request queue in parallel). The answer is generally yes if the target application send a lot of requests and the OS stack to not provide any mechanism to regulate the amount of incoming requests. That being said, some API provides it or even guarantee it (asynchronous IO ring buffers are a good example).
Put it shortly, it depends of the exact target device, the target operating system, the OS API/stack used as well as the application itself. The system can be seen as big platform-dependent dataflow where queues are everywhere so one needs to carefully specify what "MDOP" and "queuing" means in this context.
You cannot expect anyone to know what you mean by MDOP unless you mention the precise technology in the context of which you are using this term. Microsoft SQL Server has a concept of MDOP, but you are talking about HTTP requests, so you are probably not talking about MSSQL. So, what are you talking about? Anyway, on with the question.
A computer has (for example) 16 processors. This means a max of 16 tasks can be worked on at any one time.
No, it doesn't mean that. It means that the computer can execute 16 CPU instructions simultaneously. (If we disregard pipelines, superscalar pipelines, memory contention, etc.) A "Task" is a very high-level concept which involves all sorts of things besides just executing CPU instructions. For example, it involves waiting for I/O to complete, or even waiting for events to occur, events which might be raised by other tasks.
When a system allows you to set the value of some concept such as a "degree of parallelism", this means that there is no silver bullet for that value, so depending on the application at hand, different values will yield different performance benefits. Knowing your specific usage scenario you can begin with an educated guess for a good value, but the only way to know the optimal value is to try and see how your actual system performs.
Specifically with degree of parallelism, it depends on what your threads are doing. If your threads are mostly computation-bound, then a degree of parallelism close to the number of physical cores will yield best results. If your threads are mostly I/O bound, then your degree of parallelism should be orders of magnitude larger than the number of physical cores, and the ideal value depends on how much memory each thread is consuming, because with too many threads you might start hitting memory bottlenecks.
Proof: check how many threads are currently alive on your computer. (Use some built-in system monitor utility, or download one if need be.) Notice that you have thousands of threads running. And yet look at your CPU utilization: it is probably close to zero. That's because virtually all of those thousands of threads are doing nothing most of the time but waiting for stuff to happen, like for you to press a key or for a network packet to arrive.

Can multiple tensorflow inferences run on one GPU in parallel?

I am trying to run Tensorflow as a serve on one NVIDIA Tesla V100 GPU. As a server, my program need to accept multiple requests concurrently. So, my questions are the following:
When multiple requests arrive at the same time, suppose we are not using batching, are these requests run on the GPU sequentially or in parallel? I understand independent processes have seperate CUDA contexts, which are run sequentially on the GPU. But these requests are actually different threads in the same process and should share one CUDA context. So according to the documentation, the GPU can run multiple kernels concurrently. If this is the true, does it mean if I have a large amount of requests arrive at the same time, the GPU utilization can go up to 100%? But this never happen in my experiment.
What is the difference between running one session in different threads vs. running different sessions in different threads? Which is the proper way to implement a Tensorflow server? Which one does Tensorflow Serving use?
Any advice will be appreciated. Thank you!
Regarding #1: all requests will be run on the same GPU sequentially, since TF uses a global single compute stream for each physical GPU device (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L284)
Regarding #2: in terms of multi-streaming, the two options are similar: by default multi-streaming is not enabled. If you want to experiment with multi-streams, you may try the virtual_device option (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L138)
Thanks.
For model inference, you may want to look at high performance inference engines like nvidia triton. It allows multiple model instances, each of which has dedicated cuda streams where GPU can exploit more parallelism.
See https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/architecture.html#concurrent-model-execution

Why Tensorflow creates so many CPU threads

Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.

Sockets - select / thread / both

Recently I have learnt about network programming. I know that for server to handle multiple clients, there is a need to use select or Thread (at least in python/c/c++, I do not know nothing about something similar to select in java, in java I only know the thread approach).
I have read that using select is better from the performance point of view and threads are better for small servers. However, yesterday I found this page: http://www.assembleforce.com/2012-08/how-to-write-a-multi-threading-server-in-python.h and I do not understand why in the provided code guy uses both select and threads? It's difficult for me to understand how does exactly it works and why it is better than other methods I mentioned? I do not understand the idea behind this code.
Thank you.
Threads and select are not mutually exclusive.
Multi-threading is a form of parallel processing, allowing a single process to seemingly perform multiple tasks in an asynchronous manner.
Using select allows your program to monitor a file descriptor (e.g, a socket), waiting for an event.
Both can (and, to my knowledge, are frequently) used together. In a network server environment, threading can be used to service multiple clients, while select is used so that one of the threads will not hog CPU time while idling.
Imagine that you are receiving data from multiple clients. A thread is waiting for data from client1, which is taking too long, meanwhile, client2 is sending data like crazy. You have three options:
Without select, using blocking calls: Block waiting for data from client1, and leave client2 waiting.
With select, using non-blocking calls: Continuously poll client1, giving up after n tries without any data transfer.
With select: Monitor the clients sockets. If they have data to transfer, read it. Else, relinquish the current thread CPU time.
This is a simple non-blocking approach to network servers, trying to give a low latency response to client. There are different approaches, and for that I recommend you check the book UNIX Network Programming.

Boost: multithread performance, reuse of threads/sockets

I'll first describe my task and then present my questions below.
I am trying to implement the "one thread one connection" scheme for our distributed DAQ system. I have used Boost for threads (thread_group) and ASIO for sockets, on a Linux platform.
We have 320 networked DAQ modules. Approx once every 0.25ms, about half of them will each generate a packet of data (size smaller than standard MTU) and send to a linux server. Each of the modules has its own long life TCP connection to its dedicated port on the server. That is, the server side application runs 320 threads 320 tcp syncronous receivers, on a 1Gbe NIC, 8 CPU cores.
The 320 threads do not have to do any computing on the incoming data - only receive data, generate and add timestamp and store the data in thread owned memory. The sockets are all syncronous, so that threads that have no incoming data are blocked. Sockets are kept open for duration of a run.
Our requirement is that the threads should read their individual socket connections with as little time lag as possible. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second.
My problem is this: I first tested the system by firing time syncronized data at the server (incoming data on different sockets are less than few microsecs apart). When the number of data packets is very small (less than 10), I find that the threads timestamps are separated by few microsecs. However, if more than 10 then the timestamps are spread by as much as 0.7sec.
My questions are:
Have I totally misunderstood the C10K issue and messed up the implementation? 320 does seems trivial compared to C10K
Any hints as to whats going wrong?
Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
320 threads is chump change in terms of resources, but the scheduling may pose issues.
320*0.25 = 80 requests per seconds, implying at least 80 context switches because you decided you must have each connection on a thread.
I'd simply suggest: don't do this. It's well known that thread-per-connection doesn't scale. And it almost always implies further locking contention on any shared resources (assuming that all the responses aren't completely stateless).
Q. Having read about the C10K and this post I expected that each thread will easily process the equivalent of atleast 1K of MTU size packets every second
Yes. A single thread can easily sustain that (on most systems). But that is no longer true, obviously, if you have hundreds of threads trying to the same, competing for a physical core.
So for maximum throughput and low latency, it's hardly ever useful to have more threads than there are available (!) physical cores.
Q. Could this really be a case for reuse of threads and/or sockets? (I really dont know how to implement reuse in my case, so any explanation is appreciated.)
The good news is that Boost Asio makes it very easy to use a single thread (or a limited pool of threads) to service the asynchronous tasks from it's service queue.
That is, assuming you did already use the *_async version of ASIO API functions.
I think the vast majority - if not all - the Boost Asio examples of asynchronous IO show how to run the service on a limited number of threads only.
http://www.boost.org/doc/libs/1_57_0/doc/html/boost_asio/examples.html

Resources