Multithreading in DirectX 12 - multithreading

I am having a hard time trying to swallow a concept of multithreaded render in DX12.
According to MSDN one must write draw commands into direct command lists (preferably using bundles) and then submit those lists to a command queue.
It is also said that one can have more than one command queue for direct command lists. But it is unclear for me what is the purpose of doing so.
I take the full profit of multithreading by building command lists in parallel threads, don't i? If so, why would i want to have more than one command queue associated with the device?
I suspect that improper management of command queues can lead to enormous troubles with performance in later stages of rendering library development.

The main benefit to directx 12 is that execution of commands is almost purely asynchronous. Meaning when you call ID3D12CommandQueue::ExecuteCommandLists it will kick off work of the commands passed in. This brings another point however. A common misconception is that rendering is somehow multithreaded now, and this is just simply not true. All work is still executed on the GPU. However command list recording is what is done on several threads, as you will create a ID3D12GraphicsCommandList object for each thread needing it.
An example:
DrawObject DrawObjects[10];
ID3D12CommandQueue* GCommandQueue = ...
void RenderThread1()
{
ID3D12GraphicsCommandList* clForThread1 = ...
for (int i = 0; i < 5; i++)
clForThread1->RecordDraw(DrawObjects[i]);
}
void RenderThread2()
{
ID3D12GraphicsCommandList* clForThread2 = ...
for (int i = 5; i < 10; i++)
clForThread2->RecordDraw(DrawObjects[i]);
}
void ExecuteCommands()
{
ID3D12GraphicsCommandList* cl[2] = { clForThread1, clForThread2 };
GCommandQueue->ExecuteCommandLists(2, cl);
GCommandQueue->Signal(...)
}
This example is a very rough use case, but that is the general idea. That you can record objects of your scene on different threads to remove the CPU overhead of recording the commands.
Another useful thing however is that with this setup, you can kick off rendering tasks and start recording another.
An example
void Render()
{
ID3D12GraphicsCommandList* cl = ...
cl->DrawObjectsInTheScene(...);
CommandQueue->Execute(cl); // Just send it to the gpu to start rendering all the objects in the scene
// And since we have started the gpu work on rendering the scene, we can move to render our post processing while the scene is being rendered on the gpu
ID3D12GraphicsCommandList* cl2 = ...
cl2->SetBloomPipelineState(...);
cl2->SetResources(...);
cl2->DrawOnScreenQuad();
}
The advantage here over directx 11 or opengl is that those apis potentially just sit there and record and record, and possibly don't send their commands until Present() is called, which forces the cpu to wait, and incurring an overhead.

Related

How to compile OpenCL-programs on multiple cores?

OpenCL-programs/kernels get build/compiled at runtime using the clBuildProgram() function. My program dynamically creates kernels to build and as such is spending a considerable amount of time compiling them. Of course, seeing that there are many kernels and they are completely independent from each other, I would like to split this work over multiple cores, as shown in the snippet below:
This person seems to have a very similar problem, but this was 6 years ago and the solution is not really satisfactory imo
ThreadPool tempPool = ThreadPool();
auto start = std::chrono::steady_clock::now();
for (int reps = 0; reps < 50; reps++) {
tempPool.addJob([this] () {
auto start = std::chrono::steady_clock::now();
//These would hold the program sources
std::vector<const char*> sources = {sourceCode.toRawUTF8()};
std::vector<const size_t> sourceLengths = {sourceCode.getNumBytesAsUTF8()};
cl_int ret;
cl_program program = clCreateProgramWithSource(getCLContext()(), 1, sources.data(), sourceLengths.data(), &ret);
// Build the program
ret = clBuildProgram(program, 1, &getCLDevices()[0](), NULL, NULL, NULL);
if (ret) {
//Generic error checking
}
auto singleDuration = std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - start).count();
});
}
//Simple way to wait for all jobs to be finished
while (tempPool.getNumJobs() > 0) {
Thread::sleep(1);
}
auto totaDuration = std::chrono::duration <double, std::milli> (std::chrono::steady_clock::now() - start).count();
Everything I do using this ThreadPool setup results in a speedup of 5-6 (I have 8 threads), which is to be expected. However, building OpenCL-kernels does not. It seems as if there can only be one kernel building at the same time.
Is there a solution to this? I'm on MacOS atm, but I would also be interested in Linux/Windows.
If not, is there a way to build OpenCL-kernels which does not involve clBuildProgram(), but for example gcc or a similar solution?
(I am surprised that the driver for your platform isn't already multithreaded. Are you sure you're calls are really parallel.)
If you're still stuck, a wretched hack that might work for that extends the solution in your referenced question follows. For some drivers clCreateProgramWithBinaries is much faster. Hence,
fork new processes (or call a helper executable that uses the same device set)
each subprocess calls clCreateProgramWithSource and then clBuildProgram
the children call clGetProgramInfo(...CL_PROGRAM_BINARIES...) to fetch the binary and then pass it back via file, pipe, or some other interprocess communication.
Again, I'd check that your setup code again first before duct taping this hack together.

Process Vs Thread : Looking for best explanation with example c#

Apologized posting the above question here because i read few same kind of thread here but still things is not clear.
As we know that Both processes and threads are independent sequences of execution. The typical difference is that threads (of the same process) run in a shared memory space, while processes run in separate memory spaces. (quoted from this answer)
the above explanation is not enough to visualize the actual things. it will be better if anyone explain what is process with example and how it is different than thread with example.
suppose i start a MS-pain or any accounting program. can we say that accounting program is process ? i guess no. a accounting apps may have multiple process and each process can start multiple thread.
i want to visualize like which area can be called as process when we run any application. so please explain and guide me with example for better visualization and also explain how process and thread is not same. thanks
suppose i start a MS-pain or any accounting program. can we say that accounting program is process ?
Yes. Or rather the current running instance of it is.
i guess no. a accounting apps may have multiple process and each process can start multiple thread.
It is possible for a process to start another process, but relatively usual with windowed software.
The process is a given executable; a windowed application, a console application and a background application would all each involve a running process.
Process refers to the space in which the application runs. With a simple process like NotePad if you open it twice so that you have two NotePad windows open, you have two NotePad processes. (This is also true of more complicated processes, but note that some do their own work to keep things down to one, so e.g. if you have Firefox open and you run Firefox again there will briefly be two Firefox processes but the second one will tell the first to open a new window before exiting and the number of processes returns to one; having a single process makes communication within that application simpler for reasons we'll get to now).
Now each process will have to have at least one thread. This thread contains information about just what it is trying to do (typically in a stack, though that is certainly not the only possible approach). Consider this simple C# program:
static int DoAdd(int a, int b)
{
return a + b;
}
void Main()
{
int x = 2;
int y = 3;
int z = DoAdd(x, y);
Console.WriteLine(z);
}
With this simple program first 2 and 3 are stored in places in the stack (corresponding with the labels x and y). Then they are pushed onto the stack again and the thread moves to DoAdd. In DoAdd these are popped and added, and the result pushed to the stack. Then this is stored in the stack (corresponding with the labels z). Then that is pushed again and the thread moves to Console.WriteLine. That does its thing and the thread moves back to Main. Then it leaves and the thread dies. As the only foreground thread running its death leads to the process also ending.
(I'm simplifying here, and I don't think there's a need to nitpick all of those simplifications right now; I'm just presenting a reasonable mental model).
There can be more than one thread. For example:
static int DoAdd(int a, int b)
{
return a + b;
}
static void PrintTwoMore(object num)
{
Thread.Sleep(new Random().Next(0, 500));
Console.WriteLine(DoAdd(2, (int)num));
}
void Main()
{
for(int i = 0; i != 10; ++i)
new Thread(PrintTwoMore).Start(i);
}
Here the first thread creates ten more threads. Each of these pause for a different length of time (just to demonstrate that they are independent) and then do a similar task to the first example's only thread.
The first thread dies upon creating the 10th new thread and setting it going. The last of these 10 threads to be running will be the last foreground thread and so when it dies so does the process.
Each of these threads can "see" the same methods and can "see" any data that is stored in the application though there are limits on how likely they are to stamp over each other I won't get into now.
A process can also start a new process, and communicate with it. This is very common in command-line programs, but less so in windowed programs. In the case of Windowed programs its also more common on *nix than on Windows.
One example of this would be when Geany does a find-in-directory operation. Geany doesn't have its own find-in-directory functionality but rather runs the program grep and then interprets the results. So we start with one process (Geany) with its own threads running then one of those threads causes the grep program to run, which means we've also got a grep process running with its threads. Geany's threads and grep's threads cannot communicate to each other as easily as threads in the same process can, but when grep outputs results the thread in Geany can read that output and use that to display those results.

Threadpool queueuserworkitem with many threads

Heads up: I am not very familiar with working with threadpool, which might be obvious from the following code. I am under the impression that I could push many values into this queue and then it would wait for one thread to complete and then move onto the next and the system would handle the synchronization of how many threads to be running.
I am trying to use ThreadPool::QueueUserWorkItem(waitcallback, num) where the value of num is iterated up to a dynamic value depending on some prior algorithm. The problem I am coming across is the program crashes when it gets too high.
WaitCallback^ wcb = gcnew WaitCallBack(this, &createImage);
for(int i = 0; i < numBlocks; i++)
{
ThreadPool::QueueUserWorkItem(wcb, i);
}
I get the message "Runtime Error! This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information."
My most resent run through had numBlocks = 644.
It's hard to say what caused the program to crash. Most likely, an exception was thrown in one of the threads, and that brought the program down. You'll have to determine where in your code the exception was thrown.
As you know, ThreadPool::QueueUserWorkItem queues an item to be processed by the threadpool. But there can be multiple threads processing items from that queue. For example, you could have 20 pool threads, with 15 of them processing the work items that you queued.
If you really have that many items to process and you want them done one at a time, why not just queue one thread to do them one at a time. I've never done managed C++, so I won't try to write an example with it. But perhaps you can translate this C# code:
void ProcessInBackground(object state)
{
int numBlocks = (int)state;
for (int i = 0; i < numBlocks; ++i)
{
createImage(i);
}
}
And then you can call it with:
ThreadPool::QueueUserWorkItem(ProcessInBackground, numBlocks);
That creates a single thread that will process the items in order.
I suspect you can convert that to managed C++ fairly easily.

QThread execution freezes my GUI

I'm new to multithread programming. I wrote this simple multi thread program with Qt. But when I run this program it freezes my GUI and when I click inside my widow, it responds that your program is not responding .
Here is my widget class. My thread starts to count an integer number and emits it when this number is dividable by 1000. In my widget simply I catch this number with signal-slot mechanism and show it in a label and a progress bar.
Widget::Widget(QWidget *parent) :
QWidget(parent),
ui(new Ui::Widget)
{
ui->setupUi(this);
MyThread *th = new MyThread;
connect( th, SIGNAL(num(int)), this, SLOT(setNum(int)));
th->start();
}
void Widget::setNum(int n)
{
ui->label->setNum( n);
ui->progressBar->setValue(n%101);
}
and here is my thread run() function :
void MyThread::run()
{
for( int i = 0; i < 10000000; i++){
if( i % 1000 == 0)
emit num(i);
}
}
thanks!
The problem is with your thread code producing an event storm. The loop counts very fast -- so fast, that the fact that you emit a signal every 1000 iterations is pretty much immaterial. On modern CPUs, doing a 1000 integer divisions takes on the order of 10 microseconds IIRC. If the loop was the only limiting factor, you'd be emitting signals at a peak rate of about 100,000 per second. This is not the case because the performance is limited by other factors, which we shall discuss below.
Let's understand what happens when you emit signals in a different thread from where the receiver QObject lives. The signals are packaged in a QMetaCallEvent and posted to the event queue of the receiving thread. An event loop running in the receiving thread -- here, the GUI thread -- acts on those events using an instance of QAbstractEventDispatcher. Each QMetaCallEvent results in a call to the connected slot.
The access to the event queue of the receiving GUI thread is serialized by a QMutex. On Qt 4.8 and newer, the QMutex implementation got a nice speedup, so the fact that each signal emission results in locking of the queue mutex is not likely to be a problem. Alas, the events need to be allocated on the heap in the worker thread, and then deallocated in the GUI thread. Many heap allocators perform quite poorly when this happens in quick succession if the threads happen to execute on different cores.
The biggest problem comes in the GUI thread. There seems to be a bunch of hidden O(n^2) complexity algorithms! The event loop has to process 10,000 events. Those events will be most likely delivered very quickly and end up in a contiguous block in the event queue. The event loop will have to deal with all of them before it can process further events. A lot of expensive operations happen when you invoke your slot. Not only is the QMetaCallEvent deallocated from the heap, but the label schedules an update() (repaint), and this internally posts a compressible event to the event queue. Compressible event posting has to, in worst case, iterate over entire event queue. That's one potential O(n^2) complexity action. Another such action, probably more important in practice, is the progressbar's setValue internally calling QApplication::processEvents(). This can, recursively call your slot to deliver the subsequent signal from the event queue. You're doing way more work than you think you are, and this locks up the GUI thread.
Instrument your slot and see if it's called recursively. A quick-and-dirty way of doing it is
void Widget::setNum(int n)
{
static int level = 0, maxLevel = 0;
level ++;
maxLevel = qMax(level, maxLevel);
ui->label->setNum( n);
ui->progressBar->setValue(n%101);
if (level > 1 && level == maxLevel-1) {
qDebug("setNum recursed up to level %d", maxLevel);
}
level --;
}
What is freezing your GUI thread is not QThread's execution, but the huge amount of work you make the GUI thread do. Even if your code looks innocuous.
Side Note on processEvents and Run-to-Completion Code
I think it was a very bad idea to have QProgressBar::setValue invoke processEvents(). It only encourages the broken way people code things (continuously running code instead of short run-to-completion code). Since the processEvents() call can recurse into the caller, setValue becomes a persona-non-grata, and possibly quite dangerous.
If one wants to code in continuous style yet keep the run-to-completion semantics, there are ways of dealing with that in C++. One is just by leveraging the preprocessor, for example code see my other answer.
Another way is to use expression templates to get the C++ compiler to generate the code you want. You may want to leverage a template library here -- Boost spirit has a decent starting point of an implementation that can be reused even though you're not writing a parser.
The Windows Workflow Foundation also tackles the problem of how to write sequential style code yet have it run as short run-to-completion fragments. They resort to specifying the flow of control in XML. There's apparently no direct way of reusing standard C# syntax. They only provide it as a data structure, a-la JSON. It'd be simple enough to implement both XML and code-based WF in Qt, if one wanted to. All that in spite of .NET and C# providing ample support for programmatic generation of code...
The way you implemented your thread, it does not have its own event loop (because it does not call exec()). I'm not sure if your code within run() is actually executed within your thread or within the GUI thread.
Usually you should not subclass QThread. You probably did so because you read the Qt Documentation which unfortunately still recommends subclassing QThread - even though the developers long ago wrote a blog entry stating that you should not subclass QThread. Unfortunately, they still haven't updated the documentation appropriately.
I recommend reading "You're doing it wrong" on Qt Blog and then use the answer by "Kari" as an example of how to set up a basic multi-threaded system.
But when I run this program it freezes my GUI and when I click inside my window,
it responds that your program is not responding.
Yes because IMO you're doing too much work in thread that it exhausts CPU. Generally program is not responding message pops up when process show no progress in handling application event queue requests. In your case this happens.
So in this case you should find a way to divide the work. Just for the sake of example say, thread runs in chunks of 100 and repeat the thread till it completes 10000000.
Also you should have look at QCoreApplication::processEvents() when you're performing a lengthy operation.

node.js C addon queueing by uv_queue_work

I have created a C node.js addon with the help of libUV to make the addon asynchronous.
I have made several queues for this.
The code is like this, loopArray is used for storing those queues:
//... variables declarations
void AsyncWork(uv_work_t* req) {
// ...
}
void AsyncAfter(uv_work_t* req) {
// ...
}
Handle<Value> RunCallback(const Arguments& args) {
// ... some preparation work
int loopNumber = (rand() % 10);
int status = uv_queue_work(loopArray[loopNumber], &baton->request, AsyncWork, AsyncAfter);
uv_run(loopArray[loopNumber]);
return Undefined();
}
extern "C" {
static void Init(Handle<Object> target) {
int i = 0;
for (i = 0; i< 10; i++){
loopArray[i] = uv_loop_new();
}
target->Set(String::NewSymbol("callback"), FunctionTemplate::New(RunCallback)->GetFunction());
}
}
NODE_MODULE(addon, Init)
The problem is that, even I created 10 queues for the CPU-demanding tasks. node.js does not switch between tasks while processing one of the queue. Is it due to the single-thread nature of node.js?
Is so, does uv_thread_create helps the situtation?
I cannot find any code sample for this, so I am not sure how to use it.
Thanks!
That is the main idea behind node's architecture: Using function call(back)s and a main event loop to run them instead of using threads to process multiple jobs in parallel.
If what you want to do is to process a queue of jobs, the best way to do it is doing one job at a time. Utilizing multiple cpu cores on a system is done by multiple node instances instead of threads. We have child_process and cluster node modules for this.
When you create multiple threads, let's say you want to run 10 threads for your work, if your system has 8 cpu cores, you are killing the performance by giving unnecessary work to operating system's scheduler. This is a very important point you should take into account. If you have 8 cores, you should not create more than 8 threads in parallel if you want the maximum performance.
For node, we don't try to create multiple queues or threads in one process. Instead, we employ multiple node processes, again maximum one process per core.
If you are going to process a queue which is already there. In this kind of work, you do not need your C module to be asynchronous.
We want asynchronous behavior when we have jobs coming from outside like http requests on a web server. On a web server, our job comes in a way that we cannot control. People and other machines connect to our server whenever they want and we want to answer each of them as quickly as possible. For this, we do not want any request to block others. We need to handle as many requests as we can in parallel.
If you are running on rows of a database table or making some calculations over a long list of parameters however, you are in a very different kind of business. You have your job queue in front of you waiting for your way of management. Your jobs are not coming to your system in a way you have no control over. In this kind of business, to reach the ultimate efficiency and hit the topmost profits, you should run jobs one after another without any switching between them. Parallelism is only good when you have multiple cores and to employ them, the best practice for node is to use multiple node processes.

Resources