I have a small problem, where GPU or CPU multithreading does not give much speed. I also have a machine with many CPU threads (no GPU) and I would like to use them to train or tune many model variants on the problem in parallel, each variant using only one thread. How to do this?
Edit: Here are snippets of stuff I would like to do:
from kerastuner.tuners import RandomSearch
tuner = RandomSearch(...)
tuner.search(...) # a machine with 128 CPU threads should search 128 models in parallel, training each with only one thread
def fit_all(models, x, y):
...
fit_all(models, x, y) # a machine with 16 CPU threads should train 16 models in parallel, each using only one thread
Related
I am calling a Flask server(running with 64 gunicorn workers) via http with ThreadPoolExecutor to test execution time.
Both(test script and Flask server) are running on the same host. Inside the Flask endpoint, there is a DB(Postgres) call.
Machine stats (11 cores, 16 GB ram)
Test script:
class StressTest():
def __init__(self, max_workers):
self.__max_workers = max_workers
def __async_test(self, x=range(1000)):
with ThreadPoolExecutor(max_workers=self.__max_workers) as executor:
futures = [
executor.submit(
self.__test,
n=n
)
for n in x
]
for future in futures:
try:
future.result()
except Exception as e:
raise Exception(e)
def __test(self, n):
# Rest call
def run(self):
start_time = datetime.now()
self.__async_test()
duration = datetime.now() - start_time
self.logger.info(f"duration for the process is {duration.total_seconds()} seconds")
if __name__ == "__main__":
StressTest(1).run()
StressTest(2).run()
StressTest(4).run()
.......
Results as follows:
No of threads
Execution time (s)
1
50
2
25
4
12
8
8.5
16
8
32
7.5
64
7.5
My question is why execution time is saturating after 8 threads? Isn't it possible to run multiple threads at the same time in a given core?
Any given time, only 11 threads are running (via htop) during the test.
Please let me know if you want additional information about my test. Thank you!
TL/DR: Regardless a programming language used, given core can execute only 1 thread for a given moment of time.
More details:
While we say "thread" we can mean CPU-level execution entity (and the answer is above, period), or we can actually talk about so called "tasks", or "async execution", or "futures", whatever. All that terms are about describing a next-level abstraction over a thread via specifying logical steps to get the business be done, not about direct utilization of CPU capabilities. However, actual execution of that abstraction is still performed via CPU thread(s). And that returns us back to the very first disclaimer: given core can execute only 1 thread for a given moment of time.
It (ThreadPoolExecutor) utilizes at most 32 CPU cores for CPU bound tasks which release the GIL
It's that latter condition which should worry you. "GIL" is the Global Interpreter Lock. Your Python runtime has only one such lock, that's why it is called "Global". And it's used for pretty much anything that works with memory. Even a simple a=b can require the GIL. Statements like these are therefore not faster if more cores are available - there simply aren't more locks.
However, I/O operations are designed so that they do not need the GIL while the OS is working on the operation. In this case, multiple cores could help a bit - but here the OS typically is waiting on other hardware, not the CPU.
Postgres can use multiple threads, so you can have multiple threads execute multiple queries. This would likely give a speed-up for that part of the processing. As noted in the comments, Amdahl's Law tells you that a program with mixed multi-threaded and single-threaded parts will be limited by its single-threaded parts if enough cores are available. And in Python, the GIL behaves as that single-threaded part.
I want to run a test to see how the synchronization works. I assume that at the end of each batch, DDP would wait for the processes on the world_size GPUs to reach the synchronization point like a backward pass to synchronize gradients. I used a 4-GPU machine and use environment variable CUDA_VISIBLE_DEVICES to make sure only 2 GPU processes can be started. If only 2 GPU processes started, I assume that at the end of the first batch, the synchronization on the existing 2 GPUS would wait on the other two and time out as the other two never started. What I observed is that the training continued with only 2 GPU processes, even though the world size is 4. How to explain this? Is my understanding not correct?
I am trying to run Tensorflow as a serve on one NVIDIA Tesla V100 GPU. As a server, my program need to accept multiple requests concurrently. So, my questions are the following:
When multiple requests arrive at the same time, suppose we are not using batching, are these requests run on the GPU sequentially or in parallel? I understand independent processes have seperate CUDA contexts, which are run sequentially on the GPU. But these requests are actually different threads in the same process and should share one CUDA context. So according to the documentation, the GPU can run multiple kernels concurrently. If this is the true, does it mean if I have a large amount of requests arrive at the same time, the GPU utilization can go up to 100%? But this never happen in my experiment.
What is the difference between running one session in different threads vs. running different sessions in different threads? Which is the proper way to implement a Tensorflow server? Which one does Tensorflow Serving use?
Any advice will be appreciated. Thank you!
Regarding #1: all requests will be run on the same GPU sequentially, since TF uses a global single compute stream for each physical GPU device (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/gpu/gpu_device.cc#L284)
Regarding #2: in terms of multi-streaming, the two options are similar: by default multi-streaming is not enabled. If you want to experiment with multi-streams, you may try the virtual_device option (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto#L138)
Thanks.
For model inference, you may want to look at high performance inference engines like nvidia triton. It allows multiple model instances, each of which has dedicated cuda streams where GPU can exploit more parallelism.
See https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/architecture.html#concurrent-model-execution
I am currently developing a CNN model in Keras using model.fit_generator and I currently have a generator developed using keras.utils.Sequences class. My problem is that looking over GPU utilization, it is not as high as it should be meaning the current model is CPU bottle-necked. I have played around with what the generator is doing to the data to make it more efficient, but it is still bottle-necked. My ideal situation is to have the generator continuously process the data and store it into memory (even single threaded) and for it to be put into the GPU when necessary. Essentially, I was wondering if there is a way to have the generator asynchronously process the data for an efficient generator method. Currently, the generator processes a batch, loads the batch into the GPU and waits for the GPU to finish. I have tweaked max_queue_size, workers, and use_multiprocessing, but nothing seems to have the GPU working to its full potential.
GPU usage may be either data transfer or computing. Try to find out which takes longer. Then you can better understand the batch size effects.
Even with
inter_op_parallelism_threads = 1
intra_op_parallelism_threads = 1
values set, TensorFlow 1.5 process is not single-threaded. Why? Is there a way to completely disable unexpected thread spawning?
First of all, TensorFlow is a multi-level software stack, and each layer tries to be smart and introduces some worker threads of its own:
One thread is created by Python runtime
Two more threads are created by NVIDIA CUDA runtime
Next, there are threads originating from the way how TensorFlow administers internal compute jobs:
Threads are created/joined all the time to poll on job completion (GRPC engine)
Thus, TensorFlow cannot be single-threaded, even with all options set to 1. Perhaps, this design is intended to reduce latencies for async jobs. Yet, there is a certain drawback: multicore compute libraries, such as linear algebra, do cache-intensive operations best with static symmetric core-thread mapping. And dangling callback threads produced by TensorFlow will disturb this symmetry all the time.