Utilize GPU in Dart / Flutter other than graphics - multithreading

Aye Aye good people,
I was wandering if I can 'use' ('calls', 'threads'?)
the GPU with Dart and Flutter.
The documentation states
The GPU thread executes graphics code from the Flutter Engine. This
thread takes the layer tree and displays it by talking to the GPU
(graphic processing unit). You cannot directly access the GPU thread
or its data, but if this thread is slow, it’s a result of something
you’ve done in the Dart code. Skia, the graphics library, runs on this
thread, which is sometimes called the rasterizer thread.[...] More
information on profiling the GPU thread can be found at flutter.dev.
(which doesn't add much)
But what if I don't want to use it for graphics?
Let's say for example that I want to use Monte carlo method,
for some calculation,
could I make a call or send a thread to the GPU?
thank you for your attention

"GPU thread" was a confusing name, so we renamed it to "raster thread". This thread is actually running on a CPU core, and its function is to rasterize graphics to be sent to the GPU. Many people assumed the thread is running on the GPU itself, but that is not the case. Thus, the rename.
(We renamed it pretty recently. Your original question used the correct terminology at the time.)
You cannot compile Dart code to be run on the GPU (like CUDA), unfortunately, the way you can do that with C++, for example.
An option is to write your Monte Carlo routine in something like C++, then use Dart's FFI to call that routine from Dart code. This will run synchronously and as fast as you can make the C++ code go.

Related

How should I make multithreaded program that uses GPU for computation?

I am making simulation program that uses compute shaders and i ran into a problem. I am currently using OpenGL context to render GUI stuff to control and watch simulation. And I use the same context to call glDispatchCompute.
That could cause program window to freeze, because simulation might run in any UPS (like 0.1 - 10000 updates per second) and window should update in fixed FPS (display refresh rate, 60 FPS in common).
That becomes a problem, when simulation is slow and single step takes, for example 600 ms to compute. And swap buffers function waits for all compute shaders to perform, and so - FPS drops.
How can I make updates and renders independent from each other? On CPU I can just spawn second thread, but OpenGL context is not multithreaded. Should I use Vulkan for this task?
Even with Vulkan, there is no way to just shove a giant blob of work at the GPU and guarantee that later graphics work will just interrupt the GPU's processing. The most reliable way to handle this is to break your compute work up into chunks of a size that you're reasonably sure will not break your framerate and interleave them with your rendering commands.
Vulkan offers ways that allow GPUs to execute interruptable work, but do not require any particular interrupting functionality. You can create a compute queue that has the lowest possible priority and create a graphics queue with the highest priority. But even that assumes:
The Vulkan implementation offers multiple queues at all. Many embedded ones do not.
Queue priority implementations will preempt work in progress if higher-priority work is submitted. This may happen, but the specification offers no guarantees. Here is some documentation about the behavior of GPUs, so some of them can handle that. It's a few years old, so more recent GPUs may be even better, but it should get you started.
Overall, Vulkan can help, but you'll need to profile it and you'll need to have a fallback if you care about implementations that don't have enough queues to do anything.
OpenGL is of course even less useful for this, as it has no explicit queueing system at all. So breaking the work up is really the only way to ensure that the compute task doesn't starve the renderer.

Independent Thread Scheduling since Volta

Nvidia introduced a new Independent Thread Scheduling for their GPGPUs since Volta. In case CUDA threads diverge, alternative code paths are not executed in blocks but instruction-wise. Still, divergent paths can not be executed at the same time since the GPUs are SIMT as well. This is the original article:
https://developer.nvidia.com/blog/inside-volta/ (scroll down to "Independent Thread Scheduling").
I understood what this means. What I don't understand is, in which way this new behavoir accelerates code. Even the before/after diagrams in the above article do not reflect an overall speed-up.
My question: Which kinds of divergent algorithms will run faster on Volta (and newer) due to the described new scheduling?
The purpose of the feature is not necessarily to accelerate code.
An important purpose of the feature is to enable reliable use of programming models such as producer-consumer within a warp (amongst threads in the same warp) that would have been either brittle or prone to hang using the previous thread schedulers pre-volta.
The typical example IMO of which you can find various examples here on the cuda tag, is people trying to negotiate for atomic locks among threads in the same warp. This would have been "brittle" (and here) or not workable (hangs) on previous architectures. It works well, on volta, in my experience.
Here is another example of an algorithm that just hangs on pre-volta, but "works" (does not hang) on volta+.

How to disable multithreading in PyTorch?

I am trying to ensure that a PyTorch program build in c++ uses only a single thread. The program runs on CPU.
It has a fairly small model, and multi-threading doesn't help and actually causes problems because my program is multithreaded allready. I have called:
at::set_num_interop_threads(1);
at::set_num_threads(1);
torch::set_num_threads(1);
omp_set_num_threads(1);
omp_set_dynamic(0);
omp_set_nested(0);
In addition, I have set the environment variable
OPENBLAS_NUM_THREADS to 1.
Still, when I spawn in single thread, a total of 16 threads show up on htop, and 16 of the processors of the machine go to 100%.
Am I missing something? What?
From the PyTorch docs, one can do:
torch.set_num_threads(1)
To be on the safe side, do this before you instantiate any models etc (so immediately after the import). This worked for me.
More info: https://jdhao.github.io/2020/07/06/pytorch_set_num_threads/

Multi-thread code with single-core processor and single-thread code with multi-core processor

I'm new to multi-threaded programming. I have been reading some articles, but two main points I'm not completely sure about.
If I have a single-thread code (sequential), and I run it on multi-core processor. Will the OS try to divide the thread into multiple threads (while taking care of dependencies) to take advantage of the muli-core processor?
If I have a multi-thread code, and I run it on single-core processor. Will the OS make time-sharing between different threads (the same way it does with multiple processes)?
1) No
If an application makes use of, for example, the Intel maths libraries and has been compiled with the right switches, routines like FFTs will at runtime be split out into separate threads matching the number of cores in the machine. Your source code remains 'single threaded', but the library is creating and destroying threads behind your back.
Similarly some compilers (e.h. Intel's icc, Sun's C compiler) may turn some loops into separate threads, each tackling a share of the iterations. Again the source code looks single threaded, but the compiler generates threaded code on your behalf. It's a bit like automatically applying some OpenMP to your source code.
OSes cannot second guess what an application is going to do, so they cannot intervene like this. Libraries and compilers know what is about to happen, so they can.
Libraries and compiler tricks like this have been developed so as to make it easy for programmers to extract higher performance from 'single' threaded code. Intel started adding features like that to their maths library around about the same time they started heading towards multi-core CPUs. The idea was to create (from the programmer's point of view) the impression of better 'single' thread performance, whilst the speed was actually being delivered by multiple cores. Similarly with Sun when they started doing multi-processor computers.
And with everyone more or less giving up on making significant improvements to the performance of a single core, this is the only way ahead.
2) Yes. How else would it do it?
No, the operating system has not enough information to do that. In parallelization you need to consider the dependencies between operations. Some compiler try to do that, they have more information about the intent of the code. But even they often fail to do that effectively.
Yes, for example the Linux scheduler does not even distinguish between threads and processes.

Multithreading behaviour change when linking a static library to a program

I have been developing an efficient sparse matrix solver that uses the concept of multithreading (C++11 std::thread) for the past year. Doing a stand alone test on my code works perfect and all expectations were exceeded. However, when linking the code (as a static library) to the software I am developing for, the performance was way worse and from what I can see in CPU loads in task manager, all threads are running on the same core which was not the case during the standalone testing.
Does system loading have anything to do with this ?
I don't have access to the software code.
Anyone has any advice or have any explanation ?
Have you considered the tradeoffs between a context switch and the actual workload of each thread? This problem could happen if the context switch happens to be more CPU intensive than the actual load each thread is performing. Try increasing the work each thread does and see if the problem gets resolved

Resources