Is there some kind of incompatibility with Boost::thread() and Nvidia CUDA? - multithreading

I'm developing a generic streaming CUDA kernel execution Framework that allows parallel data copy & execution on the GPU.
Currently I'm calling the cuda kernels within a C++ static function wrapper, so I can call the kernels from a .cpp file (not .cu), like this:
//kernels.cu:
//kernel definition
__global__ void kernelCall_kernel( dataRow* in, dataRow* out, void* additionalData){
//Do something
};
//kernel handler, so I can compile this .cu and link it with the main project and call it within a .cpp file
extern "C" void kernelCall( dataRow* in, dataRow* out, void* additionalData){
int blocksize = 256;
dim3 dimBlock(blocksize);
dim3 dimGrid(ceil(tableSize/(float)blocksize));
kernelCall_kernel<<<dimGrid,dimBlock>>>(in, out, additionalData);
}
If I call the handler as a normal function, the data printed is right.
//streamProcessing.cpp
//allocations and definitions of data omitted
//copy data to GPU
cudaMemcpy(data_d,data_h,tableSize,cudaMemcpyHostToDevice);
//call:
kernelCall(data_d,result_d,null);
//copy data back
cudaMemcpy(result_h,result_d,resultSize,cudaMemcpyDeviceToHost);
//show result:
printTable(result_h,resultSize);// this just iterate and shows the data
But to allow parallel copy and execution of data on the GPU I need to create a thread, so when I call it making a new boost::thread:
//allocations, definitions of data,copy data to GPU omitted
//call:
boost::thread* kernelThreadOwner = new boost::thread(kernelCall, data_d,result_d,null);
kernelThreadOwner->join();
//Copy data back and print ommited
I just get garbage when printing the result on the end.
Currently I'm just using one thread, for testing purpose, so there should be no much difference in calling it directly or creating a thread. I have no clue why calling the function directly gives the right result, and when creating a thread not. Is this a problem with CUDA & boost? Am I missing something? Thank you in advise.

The problem is that (pre CUDA 4.0) CUDA contexts are tied to the thread in which they were created. When you are using two threads, you have two contexts. The context that the main thread is allocating and reading from, and the context that the thread which runs the kernel inside are not the same. Memory allocations are not portable between contexts. They are effectively separate memory spaces inside the same GPU.
If you want to use threads in this way, you either need to refactor things so that one thread only "talks" to the GPU, and communicates with the parent via CPU memory, or use the CUDA context migration API, which allows a context to be moved from one thread to another (via cuCtxPushCurrent and cuCtxPopCurrent). Be aware that context migration isn't free, and there is latency involved, so if you plan to migrating contexts around frequently, you might find it more efficient to change to a different design which preserves context-thread affinity.

Related

How to manage the cuda streams and TensorRT context in multiple threads GPU application?

For a tensorrt trt file, we will load it to an engine, and create Tensorrt context for the engine. Then use cuda stream to inference by calling context->enqueueV2().
Do we need to call cudaCreateStream() after the Tensorrt context is created? Or just need to after selecting GPU device calling SetDevice()? How the TensorRT associate the cuda stream and Tensorrt context?
Can we use multiple streams with one Tensorrt context?
In a multiple thread C++ application, each thread uses one model to inference, one model might be loaded in more than 1 thread; So, in one thread, do we just need 1 engine, 1 context and 1 stream or multiple streams?
Do we need to call cudaCreateStream() after the Tensorrt context is created?
By cudaCreateStream() do you mean cudaStreamCreate()?
You can create them after you've created your engine and runtime.
As a bonus trivia, you don't necessarily have to use CUDA streams at all. I have tried copying my data from host to device, calling enqueueV2() and then copying the from device to host without using a CUDA stream. It worked fine.
How the TensorRT associate the cuda stream and Tensorrt context?
The association is that you can pass the same CUDA stream as an argument to all of the function calls. The following c++ code will illustrate this:
void infer(std::vector<void*>& deviceMemory, void* hostInputMemory, size_t hostInputMemorySizeBytes, cudaStream_t& cudaStream)
{
auto success = cudaMemcpyAsync(deviceMemory, hostInputMemory, hostInputMemorySizeBytes, cudaMemcpyHostToDevice, cudaStream)
if (not success) {... handle errors...}
if (not executionContext.enqueueV2(static_cast<void**>(deviceMemory.data()), cudaStream, nullptr)
{ ... handle errors...}
void* outputHostMemory; // allocate size for all bindings
size_t outputMemorySizeBytes;
auto success2 = cudaMemcpyAsync(&outputHostMemory, &deviceMemory.at(0), outputMemorySizeBytes, cudaMemcpyDeviceToHost, cudaStream);
if (not success2) {... error handling ...}
cudaStream.waitForCompletion();
}
You can check this repository if you want a full working example in c++. My code above is just an illustration.
Can we use multiple streams with one Tensorrt context?
If I understood your question correctly, according to this document the answer is no.
In a multiple thread C++ application, each thread uses one model to inference, one model might be loaded in more than 1 thread; So, in one thread, do we just need 1 engine, 1 context and 1 stream or multiple streams?
one model might be loaded in more than 1 thread
this doesn't sound right.
An engine (nvinfer1::ICudaEngine) is created from a TensorRT engine file. The engine creates an execution context that is used for inference.
This part of TensorRT developer guide states which operations are thread safe. The rest can be considered non-thread safe.

Vulkan theaded application get error message on queue submissions under mutex

I have an application with Vulkan for rendering and glfw for windowing. If I start several threads, each with a different window, I get errors on threading and queue submission even though ALL vulkan calls are protected by a common mutex. The vulkan layer says:
THREADING ERROR : object of type VkQueue is simultaneously used in thread 0x0 and thread 0x7fc365b99700
Here is the skeleton of the loop under which this happens in each thread:
while (!finished) {
window.draw(...);
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
The draw function skeleton looks like:
draw(Arg arg) {
static std::mutex mtx;
std::lock_guard lock{mtx};
// .... drawing calls. Including
device.acquireNextImageKHR(...);
// Fill command bufers
graphicsQueue.submit(...);
presentQueue.presentKHR(presentInfo);
}
This is C++17 which slightly simplifies the syntax but is otherwise irrelevant.
Clearly everything is under a mutex. I also intercept the call to the debug message. When I do so, I see that one thread is waiting for glfw events, one is printing the vulkan layer message and the other two threads are trying to acquire the mutex for the lock_guard.
I am at a loss as to what is going on or how to even figure out what is causing this.
I am running on linux, and it does not crash. However on Mac OS X, after a random amount of time, the code will crash in a queue submit call of MoltenVK and when the crash happens, I see a similar situation of the threads. That is to say no other thread is inside a Vulkan call.
I'd appreciate any ideas. My next move would be to move all queue submissions to a single thread, though that is not my favorite solution.
PS: I created a complete MCVE under the Vookoo framework. It is at https://github.com/FunMiles/Vookoo/tree/lock_guard_queues and is the example 00-parallelTriangles
To try it, do the following:
git clone https://github.com/FunMiles/Vookoo.git
cd Vookoo
git checkout lock_guard_queues
mkdir build
cd build
cmake ..
make
examples/00-parallelTriangles
The way you call the draw is:
window.draw(device, fw.graphicsQueue(), [&](){//some lambda});
The insides of draw is protected by mutex, but the fw.graphicsQueue() isn't.
fw.graphicsQueue() million abstraction layers below just calls vkGetDeviceQueue. I found executing vkGetDeviceQueue in parallel with vkQueueSubmit causes the validation error.
So there are few issues here:
There is a bug in layers that causes multiple initialization of VkQueue state on vkGetDeviceQueue, which is the cause of the validation error
KhronosGroup/Vulkan-ValidationLayers#1751
Thread id 0 is not a separate issue. As there are not any actual previous accesses, thread id is not recorded. The problem is the layers issue the error because the access count goes into negative because it is previously wrongly reset to 0.
Arguably there is some spec issue here. It is not immediatelly obvious from the text that VkQueue is not actually accessed in vkGetDeviceQueue, except the silent assumption that it is the sane thing to do.
KhronosGroup/Vulkan-Docs#1254

What is the proper way to get continuously processed data from a thread?

The following functions and fields are part of the same class in a Visual Studio DLL. Data is continuously being read and processed using the run function on a thread. However, getPoints is being accessed in a Qt app on a QTimer. I don't wan't to miss a single processed vector, because it seems it could be skipping leading to jumpy data. What's the safest way to get the points to the updated version?
If possible I'd like an answer that uses the C++ standard library as I've been exploring mutex-es, but it still seems to lead to jumpy data.
vector<float> points;
// std::mutex ioMutex;
// function running on a thread
void run(){
while(running){
//ioMutex.lock()
vector<byte> data = ReadData()
points = processData(data);
//ioMutex.unlock()
}
}
vector<float> getPoints(){
return points;
}
I believe there is a mistake in your code. The while loop will consume all the process activity and will not allow proper functionality of other functions. In Qt, in such continuous loops, usually it is a good habit to use the following because it actually gives other process time to access the event buffer properly. If this dll is written in Qt, please add the following within the while loop
QCoreApplication::processEvents();
The safest (and probably easiest) way to deliver your points-data to the main thread is by calling qApp->postEvent() with an object of a custom QEvent-subclass that contains your vector<float> as a member-variable.
That will cause the event(QEvent *) method of (whatever Qt object you specified as the first argument to postEvent()) to be called from inside the main/GUI thread, and so you can override that method to read the vector<float> out of the QEvent-subclassed object and update the GUI with that data.

How do two or more threads share memory on the heap that they have allocated?

As the title says, how do two or more threads share memory on the heap that they have allocated? I've been thinking about it and I can't figure out how they can do it. Here is my understanding of the process, presumably I am wrong somewhere.
Any thread can add or remove a given number of bytes on the heap by making a system call which returns a pointer to this data, presumably by writing to a register which the thread can then copy to the stack.
So two threads A and B can allocate as much memory as they want. But I don't see how thread A could know where the memory that thread B has allocated is located. Nor do I know how either thread could know where the other thread's stack is located. Multi-threaded programs share the heap and, I believe, can access one another's stack but I can't figure out how.
I tried searching for this question but only found language specific versions that abstract away the details.
Edit:
I am trying not to be language or OS specific but I am using Linux and am looking at it from a low level perspective, assembly I guess.
My interpretation of your question: How can thread A get to know a pointer to the memory B is using? How can they exchange data?
Answer: They usually start with a common pointer to a common memory area. That allows them to exchange other data including pointers to other data with each other.
Example:
Main thread allocates some shared memory and stores its location in p
Main thread starts two worker threads, passing the pointer p to them
The workers can now use p and work on the data pointed to by p
And in a real language (C#) it looks like this:
//start function ThreadProc and pass someData to it
new Thread(ThreadProc).Start(someData)
Threads usually do not access each others stack. Everything starts from one pointer passed to the thread procedure.
Creating a thread is an OS function. It works like this:
The application calls the OS using the standard ABI/API
The OS allocates stack memory and internal data structures
The OS "forges" the first stack frame: It sets the instruction pointer to ThreadProc and "pushes" someData onto the stack. I say "forge" because this first stack frame does not arise naturally but is created by the OS artificially.
The OS schedules the thread. ThreadProc does not know it has been setup on a fresh stack. All it knows is that someData is at the usual stack position where it would expect it.
And that is how someData arrives in ThreadProc. This is the way the first, initial data item is shared. Steps 1-3 are executed synchronously by the parent thread. 4 happens on the child thread.
A really short answer from a bird's view (1000 miles above):
Threads are execution paths of the same process, and the heap actually belongs to the process (and as a result shared by the threads). Each threads just needs its own stack to function as a separate unit of work.
Threads can share memory on a heap if they both use the same heap. By default most languages/frameworks have a single default heap that code can use to allocate memory from the heap. In unmanaged languages you generally make explicit calls to allocate heap memory. In C, that might be malloc, etc. for example. In managed languages heap allocation is usually automatic and how allocation is done depends on the language--usually through the use of the new operator. but, that depends slightly on context. If you provide the OS or language context you're asking about, I might be able to provide more detail.
A Thread shared with other threads belonging to the same process: its code section, data section and other operating system resources such as open files and signals.
The part you are missing is static memory containing static variables.
This memory is allocated when the program is started, and assigned known adresses (determined at the linking time). All threads can access this memory without exchanging any data runtime, because the addresses are effectively hardcoded.
A simple example might look like this:
// Global variable.
std::atomic<int> common_var;
void thread1() {
common_var = compute_some_value();
}
void thread2() {
do_something();
int current_value = common_var;
do_more();
}
And of course the global value may be a pointer, that can be used to exchange heap memory. The producer allocates some objects, the consumer takes and uses them.
// Global variable.
std::atomic<bool> produced;
SomeData* data_pointer;
void producer_thread() {
while (true) {
if (!produced) {
SomeData* new_data = new SomeData();
data_pointer = new_data;
// Let the other thread know there is something to read.
produced = true;
}
}
}
void consumer_thread() {
while (true) {
if (produced) {
SomeData* my_data = data_pointer;
data_pointer = nullptr;
// Let the other thread know we took the data.
produced = false;
do_something_with(my_data);
delete my_data;
}
}
}
Please note: these are not examples of good concurrent code, but they show the general idea without too much clutter.

Thread Communication

Is there any tool available for tracing communication among threads;
1. running in a single process
2. running in different processes (IPC)
I am presuming you need to trace this for debugging. Under normal circumstances it's hard to do this, without custom written code. For a similar problem that I faced, I had a per-processor tracing buffer, which used to record briefly the time and interesting operation that was performed by the running thread. The log was a circular trace which used to store data like this:
struct trace_data {
int op;
void *data;
struct time t;
union {
struct {
int op1_field1;
int op1_field2;
} d1;
struct {
int op2_field1;
int op2_field2;
} d2
} u;
}
The trace log was an array of these structures of length 1024, one for each processor. Each thread used to trace operations, as well as time to determine causality of events. The fields which were used to store data in the "union" depended upon the operation being done. The "data" pointer's meaning depended upon the "op" as well. When the program used to crash, I'd open the core in gdb and I had a gdb script which would go through the logs in each processor and print out the ops and their corresponding data, to find out the history of events.
For different processes you could do such logging to a file instead - one per process. This example is in C, but you can do this in whatever language you want to use, as long as you can figure out the CPU id on which the thread is running currently.
You might be looking for something like the Intel Thread Checker as long as you're using pthreads in (1).
For communication between different processes (2), you can use Aspect-Oriented Programming (AOP) if you have the source code, or write your own wrapper for the IPC functions and LD_PRELOAD it.
Edit: Whoops, you said tracing, not checking.
It will depend so much on the operating system and development environment that you are using. If you're with Visual Studio, look at the tools in Visual Studio 2010.

Resources