TensorRT multiple Threads - multithreading

I am trying to use TensorRt using the python API. I am trying to use it in multiple threads where the Cuda context is used with all the threads (everything works fine in a single thread). I am using docker with tensorrt:20.06-py3 image, and an onnx model, and Nvidia 1070 GPU.
The multiple thread approach should be allowed, as mentioned here TensorRT Best Practices.
I created the context in the main thread:
cuda.init()
device = cuda.Device(0)
ctx = device.make_context()
I tried two methods, first to build the engine in the main thread and use it in the execution thread. This case gives this error.
[TensorRT] ERROR: ../rtSafe/cuda/caskConvolutionRunner.cpp (373) - Cask Error in checkCaskExecError<false>: 10 (Cask Convolution execution)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception
Second, I tried to build the model in the thread it gives me this error:
pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context?
The error appears when I call 'cuda.Stream()'
I am sure that I can run multiple Cuda streams in parallel under the same Cuda context, but I don't know how to do it.

I found a solution. The idea is to create a normal global ctx = device.make_context() Then in each execution thread do a:
ctx.push()
---
Execute Inference Code
---
ctx.pop()
The link for the source and full sample is here

Related

Python Multiprocessing GPU receiving SIGSEV meesage

I am looking to run 2 process at the same time. The processes use AI models. One of them is almost 1Gb. I have researched and seems that the best way is to use multiprocessing. This is a Linux server and it has 8 core CPU and one GPU. Due to the weight, I need GPU to process this files. archivo_diar is the path to the file and modelo is previously loaded. Code goes like this.
from multiprocessing import Process
def diariza(archivo_diar, pipeline):
dz = pipeline(archivo_diar, pipeline)
def transcribe_archivo(archivo_modelo, modelo):
resultado = modelo.transcribe(archivo_diar)
print(resultado)
p1 = Process(target= transcribe_archivo, args = (archivos_diar, modelo))
p1.start()
After p1.start() is run, I get the following message:
SIGSEGV received at time = 16766367473 on cpu 7*
PC: # 0x7fb2c29705 144 GOMP_pararallel
What I have found so far is that is it is a problem related to memory, but I have not seen any case related to multiprocessing. As I understand, This child process inherits memory from the main process and modelo which is the heavy file is already loaded in memory so it should not be the case.
As you can see, the 2 process (in the functions) are different, what I read is that in those cases the est approach is to use Pool. I also tried using pool like this:
pool = Pool (processes = 4)
pool.imap_unordered(transcribe_archivo, [archivo_diar, modelo]
And I got the following error.
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use 'spawn' start method.
I tried using
multiprocessing.set_start_method('spawn')
and when I do pool.join() it hangs forever.
Does anyone knows the reason of this?
Thanks.

How to manage the cuda streams and TensorRT context in multiple threads GPU application?

For a tensorrt trt file, we will load it to an engine, and create Tensorrt context for the engine. Then use cuda stream to inference by calling context->enqueueV2().
Do we need to call cudaCreateStream() after the Tensorrt context is created? Or just need to after selecting GPU device calling SetDevice()? How the TensorRT associate the cuda stream and Tensorrt context?
Can we use multiple streams with one Tensorrt context?
In a multiple thread C++ application, each thread uses one model to inference, one model might be loaded in more than 1 thread; So, in one thread, do we just need 1 engine, 1 context and 1 stream or multiple streams?
Do we need to call cudaCreateStream() after the Tensorrt context is created?
By cudaCreateStream() do you mean cudaStreamCreate()?
You can create them after you've created your engine and runtime.
As a bonus trivia, you don't necessarily have to use CUDA streams at all. I have tried copying my data from host to device, calling enqueueV2() and then copying the from device to host without using a CUDA stream. It worked fine.
How the TensorRT associate the cuda stream and Tensorrt context?
The association is that you can pass the same CUDA stream as an argument to all of the function calls. The following c++ code will illustrate this:
void infer(std::vector<void*>& deviceMemory, void* hostInputMemory, size_t hostInputMemorySizeBytes, cudaStream_t& cudaStream)
{
auto success = cudaMemcpyAsync(deviceMemory, hostInputMemory, hostInputMemorySizeBytes, cudaMemcpyHostToDevice, cudaStream)
if (not success) {... handle errors...}
if (not executionContext.enqueueV2(static_cast<void**>(deviceMemory.data()), cudaStream, nullptr)
{ ... handle errors...}
void* outputHostMemory; // allocate size for all bindings
size_t outputMemorySizeBytes;
auto success2 = cudaMemcpyAsync(&outputHostMemory, &deviceMemory.at(0), outputMemorySizeBytes, cudaMemcpyDeviceToHost, cudaStream);
if (not success2) {... error handling ...}
cudaStream.waitForCompletion();
}
You can check this repository if you want a full working example in c++. My code above is just an illustration.
Can we use multiple streams with one Tensorrt context?
If I understood your question correctly, according to this document the answer is no.
In a multiple thread C++ application, each thread uses one model to inference, one model might be loaded in more than 1 thread; So, in one thread, do we just need 1 engine, 1 context and 1 stream or multiple streams?
one model might be loaded in more than 1 thread
this doesn't sound right.
An engine (nvinfer1::ICudaEngine) is created from a TensorRT engine file. The engine creates an execution context that is used for inference.
This part of TensorRT developer guide states which operations are thread safe. The rest can be considered non-thread safe.

Vulkan theaded application get error message on queue submissions under mutex

I have an application with Vulkan for rendering and glfw for windowing. If I start several threads, each with a different window, I get errors on threading and queue submission even though ALL vulkan calls are protected by a common mutex. The vulkan layer says:
THREADING ERROR : object of type VkQueue is simultaneously used in thread 0x0 and thread 0x7fc365b99700
Here is the skeleton of the loop under which this happens in each thread:
while (!finished) {
window.draw(...);
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
The draw function skeleton looks like:
draw(Arg arg) {
static std::mutex mtx;
std::lock_guard lock{mtx};
// .... drawing calls. Including
device.acquireNextImageKHR(...);
// Fill command bufers
graphicsQueue.submit(...);
presentQueue.presentKHR(presentInfo);
}
This is C++17 which slightly simplifies the syntax but is otherwise irrelevant.
Clearly everything is under a mutex. I also intercept the call to the debug message. When I do so, I see that one thread is waiting for glfw events, one is printing the vulkan layer message and the other two threads are trying to acquire the mutex for the lock_guard.
I am at a loss as to what is going on or how to even figure out what is causing this.
I am running on linux, and it does not crash. However on Mac OS X, after a random amount of time, the code will crash in a queue submit call of MoltenVK and when the crash happens, I see a similar situation of the threads. That is to say no other thread is inside a Vulkan call.
I'd appreciate any ideas. My next move would be to move all queue submissions to a single thread, though that is not my favorite solution.
PS: I created a complete MCVE under the Vookoo framework. It is at https://github.com/FunMiles/Vookoo/tree/lock_guard_queues and is the example 00-parallelTriangles
To try it, do the following:
git clone https://github.com/FunMiles/Vookoo.git
cd Vookoo
git checkout lock_guard_queues
mkdir build
cd build
cmake ..
make
examples/00-parallelTriangles
The way you call the draw is:
window.draw(device, fw.graphicsQueue(), [&](){//some lambda});
The insides of draw is protected by mutex, but the fw.graphicsQueue() isn't.
fw.graphicsQueue() million abstraction layers below just calls vkGetDeviceQueue. I found executing vkGetDeviceQueue in parallel with vkQueueSubmit causes the validation error.
So there are few issues here:
There is a bug in layers that causes multiple initialization of VkQueue state on vkGetDeviceQueue, which is the cause of the validation error
KhronosGroup/Vulkan-ValidationLayers#1751
Thread id 0 is not a separate issue. As there are not any actual previous accesses, thread id is not recorded. The problem is the layers issue the error because the access count goes into negative because it is previously wrongly reset to 0.
Arguably there is some spec issue here. It is not immediatelly obvious from the text that VkQueue is not actually accessed in vkGetDeviceQueue, except the silent assumption that it is the sane thing to do.
KhronosGroup/Vulkan-Docs#1254

Segfault in multithreaded OpenGL?

I'm running into an issue where OpenGL calls in multiple threads sometimes cause a segfault, and I can't figure out what I'm doing wrong. I'm not sharing a context or anything else between threads.
invalid CoreGraphics connection
Segmentation fault: 11
The actual CGL result code is
kCGLBadConnection - Invalid connection to Core Graphics.
https://developer.apple.com/library/mac/documentation/graphicsimaging/reference/cgl_opengl/Reference/reference.html#//apple_ref/doc/uid/TP40001186-CH3g-BBCDCEBD
The end use case here is to render images asynchronously with libuv (doing some processing on the CPU then uploading data to the GPU for rendering), but I've worked up a simple test case which replicates this issue.
https://github.com/mikemorris/headless-gl-multithreaded
You need a valid OpenGL context bound to the thread when calling glReadPixels. The CGL variant of View::resize unbinds the OpenGL context at the end, so glReadPixels is called without a OpenGL context being active. I think this might be part of the reason of your problem.
It appears that the cause of the crash is multiple threads simultaneously trying to open a display connection in CGLChoosePixelFormat (or XOpenDisplay/glXChooseVisual in GLX). Opening a single connection in the main thread and then using this connection when instantiating new threads (each of which creates their own context) seems to fix this.

Is there some kind of incompatibility with Boost::thread() and Nvidia CUDA?

I'm developing a generic streaming CUDA kernel execution Framework that allows parallel data copy & execution on the GPU.
Currently I'm calling the cuda kernels within a C++ static function wrapper, so I can call the kernels from a .cpp file (not .cu), like this:
//kernels.cu:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
};
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
}
If I call the handler as a normal function, the data printed is right.
//streamProcessing.cpp
//allocations and definitions of data omitted
//copy data to GPU
cudaMemcpy(data_d,data_h,tableSize,cudaMemcpyHostToDevice);
//call:
kernelCall(data_d,result_d,null);
//copy data back
cudaMemcpy(result_h,result_d,resultSize,cudaMemcpyDeviceToHost);
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
But to allow parallel copy and execution of data on the GPU I need to create a thread, so when I call it making a new boost::thread:
//allocations, definitions of data,copy data to GPU omitted
//call:
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
kernelThreadOwner->join();
//Copy data back and print ommited
I just get garbage when printing the result on the end.
Currently I'm just using one thread, for testing purpose, so there should be no much difference in calling it directly or creating a thread. I have no clue why calling the function directly gives the right result, and when creating a thread not. Is this a problem with CUDA & boost? Am I missing something? Thank you in advise.
The problem is that (pre CUDA 4.0) CUDA contexts are tied to the thread in which they were created. When you are using two threads, you have two contexts. The context that the main thread is allocating and reading from, and the context that the thread which runs the kernel inside are not the same. Memory allocations are not portable between contexts. They are effectively separate memory spaces inside the same GPU.
If you want to use threads in this way, you either need to refactor things so that one thread only "talks" to the GPU, and communicates with the parent via CPU memory, or use the CUDA context migration API, which allows a context to be moved from one thread to another (via cuCtxPushCurrent and cuCtxPopCurrent). Be aware that context migration isn't free, and there is latency involved, so if you plan to migrating contexts around frequently, you might find it more efficient to change to a different design which preserves context-thread affinity.

Resources