How to compile OpenCL-programs on multiple cores?

How to compile OpenCL-programs on multiple cores? - multithreading

OpenCL-programs/kernels get build/compiled at runtime using the clBuildProgram() function. My program dynamically creates kernels to build and as such is spending a considerable amount of time compiling them. Of course, seeing that there are many kernels and they are completely independent from each other, I would like to split this work over multiple cores, as shown in the snippet below:
This person seems to have a very similar problem, but this was 6 years ago and the solution is not really satisfactory imo
ThreadPool tempPool = ThreadPool();
auto start = std::chrono::steady_clock::now();
for (int reps = 0; reps < 50; reps++) {
tempPool.addJob([this] () {
auto start = std::chrono::steady_clock::now();
//These would hold the program sources
std::vector<const char*> sources = {sourceCode.toRawUTF8()};
std::vector<const size_t> sourceLengths = {sourceCode.getNumBytesAsUTF8()};
cl_int ret;
cl_program program = clCreateProgramWithSource(getCLContext()(), 1, sources.data(), sourceLengths.data(), &ret);
// Build the program
ret = clBuildProgram(program, 1, &getCLDevices()[0](), NULL, NULL, NULL);
if (ret) {
//Generic error checking
}
auto singleDuration = std::chrono::duration<double, std::milli>(std::chrono::steady_clock::now() - start).count();
});
}
//Simple way to wait for all jobs to be finished
while (tempPool.getNumJobs() > 0) {
Thread::sleep(1);
}
auto totaDuration = std::chrono::duration <double, std::milli> (std::chrono::steady_clock::now() - start).count();
Everything I do using this ThreadPool setup results in a speedup of 5-6 (I have 8 threads), which is to be expected. However, building OpenCL-kernels does not. It seems as if there can only be one kernel building at the same time.
Is there a solution to this? I'm on MacOS atm, but I would also be interested in Linux/Windows.
If not, is there a way to build OpenCL-kernels which does not involve clBuildProgram(), but for example gcc or a similar solution?

(I am surprised that the driver for your platform isn't already multithreaded. Are you sure you're calls are really parallel.)
If you're still stuck, a wretched hack that might work for that extends the solution in your referenced question follows. For some drivers clCreateProgramWithBinaries is much faster. Hence,
fork new processes (or call a helper executable that uses the same device set)
each subprocess calls clCreateProgramWithSource and then clBuildProgram
the children call clGetProgramInfo(...CL_PROGRAM_BINARIES...) to fetch the binary and then pass it back via file, pipe, or some other interprocess communication.
Again, I'd check that your setup code again first before duct taping this hack together.

Related

How to do idempotent microbenchmarks or measure/emulate CPU cycle used by a program in isolation?

My goal is the following:
Given program with fixed input and output, do a microbenchmark in an idempotent unit relative to the CPU work performed to compute the output. In other words, If you run the program multiple times with the same input, the benchmark should always result in the same value.
For instance, let's say I have this code:
// Brute force: O(n^2) | O(1)
function twoSum(nums, target) {
for (let i = 0; i < nums.length - 1; i++) { // O(n^2)
for (let j = i + 1; j < nums.length; j++) { // O(n)
if (nums[i] + nums[j] === target) {
return [i, j];
}
}
}
return [];
}
twoSum(Array(1e7).fill(2), 4);
I can easily do time node benchmarks/two-sum-implementations/runner.js and get time taken. But if I ran it multiple times, I get different times depending on what the OS was doing. Most frameworks will run it multiple times and then avg. the times, but I don't want that.
Some ideas come to the top of my mind, but not sure how to implement it, or if they work at all. So, maybe more experienced minds can shed some light here :)
Can I use docker to run a program and track how many CPU time was used by the container that runs my program and exit? Would that be a consistent metric?
Is there a program or tool that emulates CPU cycles so I can know much is a program using in isolation?
How do cloud providers like GPC and AWS bill CPU by time? What tools did they use to measure that?
Can you convert a program into its equivalent ASM (assembler) code and count the number of lines were executed by a program? Something similar to what the code coverage frameworks do with high-level code. They can count how many times a line was executed during a test.
Based on the previous question, How deep can code coverage tools go? If it can go deep enough and it's consistent, I can microbenchmark based on lines of code executed.
Any other ideas are welcome too!

Multithreading in DirectX 12

I am having a hard time trying to swallow a concept of multithreaded render in DX12.
According to MSDN one must write draw commands into direct command lists (preferably using bundles) and then submit those lists to a command queue.
It is also said that one can have more than one command queue for direct command lists. But it is unclear for me what is the purpose of doing so.
I take the full profit of multithreading by building command lists in parallel threads, don't i? If so, why would i want to have more than one command queue associated with the device?
I suspect that improper management of command queues can lead to enormous troubles with performance in later stages of rendering library development.

The main benefit to directx 12 is that execution of commands is almost purely asynchronous. Meaning when you call ID3D12CommandQueue::ExecuteCommandLists it will kick off work of the commands passed in. This brings another point however. A common misconception is that rendering is somehow multithreaded now, and this is just simply not true. All work is still executed on the GPU. However command list recording is what is done on several threads, as you will create a ID3D12GraphicsCommandList object for each thread needing it.
An example:
DrawObject DrawObjects[10];
ID3D12CommandQueue* GCommandQueue = ...
void RenderThread1()
{
ID3D12GraphicsCommandList* clForThread1 = ...
for (int i = 0; i < 5; i++)
clForThread1->RecordDraw(DrawObjects[i]);
}
void RenderThread2()
{
ID3D12GraphicsCommandList* clForThread2 = ...
for (int i = 5; i < 10; i++)
clForThread2->RecordDraw(DrawObjects[i]);
}
void ExecuteCommands()
{
ID3D12GraphicsCommandList* cl[2] = { clForThread1, clForThread2 };
GCommandQueue->ExecuteCommandLists(2, cl);
GCommandQueue->Signal(...)
}
This example is a very rough use case, but that is the general idea. That you can record objects of your scene on different threads to remove the CPU overhead of recording the commands.
Another useful thing however is that with this setup, you can kick off rendering tasks and start recording another.
An example
void Render()
{
ID3D12GraphicsCommandList* cl = ...
cl->DrawObjectsInTheScene(...);
CommandQueue->Execute(cl); // Just send it to the gpu to start rendering all the objects in the scene
// And since we have started the gpu work on rendering the scene, we can move to render our post processing while the scene is being rendered on the gpu
ID3D12GraphicsCommandList* cl2 = ...
cl2->SetBloomPipelineState(...);
cl2->SetResources(...);
cl2->DrawOnScreenQuad();
}
The advantage here over directx 11 or opengl is that those apis potentially just sit there and record and record, and possibly don't send their commands until Present() is called, which forces the cpu to wait, and incurring an overhead.

Threadpool queueuserworkitem with many threads

Heads up: I am not very familiar with working with threadpool, which might be obvious from the following code. I am under the impression that I could push many values into this queue and then it would wait for one thread to complete and then move onto the next and the system would handle the synchronization of how many threads to be running.
I am trying to use ThreadPool::QueueUserWorkItem(waitcallback, num) where the value of num is iterated up to a dynamic value depending on some prior algorithm. The problem I am coming across is the program crashes when it gets too high.
WaitCallback^ wcb = gcnew WaitCallBack(this, &createImage);
for(int i = 0; i < numBlocks; i++)
{
ThreadPool::QueueUserWorkItem(wcb, i);
}
I get the message "Runtime Error! This application has requested the Runtime to terminate it in an unusual way. Please contact the application's support team for more information."
My most resent run through had numBlocks = 644.

It's hard to say what caused the program to crash. Most likely, an exception was thrown in one of the threads, and that brought the program down. You'll have to determine where in your code the exception was thrown.
As you know, ThreadPool::QueueUserWorkItem queues an item to be processed by the threadpool. But there can be multiple threads processing items from that queue. For example, you could have 20 pool threads, with 15 of them processing the work items that you queued.
If you really have that many items to process and you want them done one at a time, why not just queue one thread to do them one at a time. I've never done managed C++, so I won't try to write an example with it. But perhaps you can translate this C# code:
void ProcessInBackground(object state)
{
int numBlocks = (int)state;
for (int i = 0; i < numBlocks; ++i)
{
createImage(i);
}
}
And then you can call it with:
ThreadPool::QueueUserWorkItem(ProcessInBackground, numBlocks);
That creates a single thread that will process the items in order.
I suspect you can convert that to managed C++ fairly easily.

node.js C addon queueing by uv_queue_work

I have created a C node.js addon with the help of libUV to make the addon asynchronous.
I have made several queues for this.
The code is like this, loopArray is used for storing those queues:
//... variables declarations
void AsyncWork(uv_work_t* req) {
// ...
}
void AsyncAfter(uv_work_t* req) {
// ...
}
Handle<Value> RunCallback(const Arguments& args) {
// ... some preparation work
int loopNumber = (rand() % 10);
int status = uv_queue_work(loopArray[loopNumber], &baton->request, AsyncWork, AsyncAfter);
uv_run(loopArray[loopNumber]);
return Undefined();
}
extern "C" {
static void Init(Handle<Object> target) {
int i = 0;
for (i = 0; i< 10; i++){
loopArray[i] = uv_loop_new();
}
target->Set(String::NewSymbol("callback"), FunctionTemplate::New(RunCallback)->GetFunction());
}
}
NODE_MODULE(addon, Init)
The problem is that, even I created 10 queues for the CPU-demanding tasks. node.js does not switch between tasks while processing one of the queue. Is it due to the single-thread nature of node.js?
Is so, does uv_thread_create helps the situtation?
I cannot find any code sample for this, so I am not sure how to use it.
Thanks!

That is the main idea behind node's architecture: Using function call(back)s and a main event loop to run them instead of using threads to process multiple jobs in parallel.
If what you want to do is to process a queue of jobs, the best way to do it is doing one job at a time. Utilizing multiple cpu cores on a system is done by multiple node instances instead of threads. We have child_process and cluster node modules for this.
When you create multiple threads, let's say you want to run 10 threads for your work, if your system has 8 cpu cores, you are killing the performance by giving unnecessary work to operating system's scheduler. This is a very important point you should take into account. If you have 8 cores, you should not create more than 8 threads in parallel if you want the maximum performance.
For node, we don't try to create multiple queues or threads in one process. Instead, we employ multiple node processes, again maximum one process per core.
If you are going to process a queue which is already there. In this kind of work, you do not need your C module to be asynchronous.
We want asynchronous behavior when we have jobs coming from outside like http requests on a web server. On a web server, our job comes in a way that we cannot control. People and other machines connect to our server whenever they want and we want to answer each of them as quickly as possible. For this, we do not want any request to block others. We need to handle as many requests as we can in parallel.
If you are running on rows of a database table or making some calculations over a long list of parameters however, you are in a very different kind of business. You have your job queue in front of you waiting for your way of management. Your jobs are not coming to your system in a way you have no control over. In this kind of business, to reach the ultimate efficiency and hit the topmost profits, you should run jobs one after another without any switching between them. Parallelism is only good when you have multiple cores and to employ them, the best practice for node is to use multiple node processes.

can i easily write a program to make use of Intel's Quad core or i7 chip if only 1 thread is used?

I wonder if in my program I have only 1 thread, can I write it so that the Quad core or i7 can actually make use of the different cores? Usually when i write programs on a Quad core computer, the CPU usage will only go to about 25%, and the work seems to be divided among the 4 cores, as the Task Manager shows. (the programs i wrote usually is Ruby, Python, or PHP, so they may not be so much optimized).
Update: what if i write it in C or C++ instead, and
for (i = 0; i < 100000000; i++) {
a = i * 2;
b = i + 1;
if (a == ... || b == ...) { ... }
}
and then use the highest level of optimization with the compiler. can the compiler make the multiplication happen on one core, and the addition happen on a different core, and therefore make 2 cores work at the same time? isn't that a fairly easy optimization to use 2 cores?

No. You need to use threads to execute multiple paths concurrently on multiple CPU's (be they real or virtual)... execution of one thread is inherently bound to one CPU as this maintains the "happens before" relationship between statements, which is central to how programs work.

First, unless multiple threads are created in the program, then there is only a single thread of execution in that program.
Seeing 25% of CPU resources being used for the program is an indication that a single core out of four is being utilized at 100%, but all other cores are not being used. If all cores were used, then it would be theoretically possible for the process to hog 100% of the CPU resources.
As a side note, the graphs shown in Task Manager in Windows is the CPU utilization by all processes running at the time, not only for one process.
Secondly, the code you present could be split into code which can execute on two separate threads in order to execute on two cores. I am guessing that you want to show that a and b are independent of each other, and they only depend on i. With that type of situation, separating the inside of the for loop like the following could allow multi-threaded operation which could lead to increased performance:
// Process this in one thread:
for (int i = 0; i < 1000; i++) {
a = i * 2;
}
// Process this in another thread:
for (int i = 0; i < 1000; i++) {
b = i + 1;
}
However, what becomes tricky is if there needs to be a time when the results from the two separate threads need to be evaluated, such as seems to be implied by the if statement later on:
for (i = 0; i < 1000; i++) {
// manipulate "a" and "b"
if (a == ... || b == ...) { ... }
}
This would require that the a and b values which reside in separate threads (which are executing on separate processors) to be looked up, which is a serious headache.
There is no real good guarantee that the i values of the two threads are the same at the same time (after all, multiplication and addition probably will take different amount of times to execute), and that means that one thread may need to wait for another for the i values to get in sync before comparing the a and b that corresponds to the dependent value i. Or, do we make a third thread for value comparison and synchronization of the two threads? In either case, the complexity is starting to build up very quickly, so I think we can agree that we're starting to see a serious mess arising -- sharing states between threads can be very tricky.
Therefore, the code example you provide is only partially parallelizable without much effort, however, as soon as there is a need to compare the two variables, separating the two operations becomes very difficult very quickly.
Couple of rules of thumbs when it comes to concurrent programming:
When there are tasks which can be broken down into parts which involve processing of data that is completely independent of other data and its results (states), then parallelizing can be very easy.
For example, two functions which calculates a value from an input (in pseudocode):
f(x) = { return 2x }
g(x) = { return x+1 }
These two functions don't rely on each other, so they can be executed in parallel without any pain. Also, as they are no states to share or handle between calculations, even if there were multiple values of x that needed to be calculated, even those can be split up further:
x = [1, 2, 3, 4]
foreach t in x:
runInThread(f(t))
foreach t in x:
runInThread(g(t))
Now, in this example, we can have 8 separate threads performing calculations. Not having side effects can be very good thing for concurrent programming.
However, as soon as there is dependency on data and results from other calculations (which also means there are side effects), parallelization becomes extremely difficult. In many cases, these types of problems will have to be performed in serial as they await results from other calculations to be returned.
Perhaps the question comes down to, why can't compilers figure out parts that can be automatically parallelized and perform those optimizations? I'm not an expert on compilers so I can't say, but there is an article on automatic parallization at Wikipedia which may have some information.

I know Intel chips very well.
Per your code, "if (a == ... || b == ...)" is a barrier, otherwise the processor cores will execute all code parallelly, regardless of compiler had done what kind of optimization. That only requires that the compiler is not a very "stupid" one. It means that the hardware has the capability itself, not software. So threaded programming or OpenMP is not necessary in such cases though they will help on improving parallel computing. Note here doesn't mean Hyper-threading, just normal multi-core processor functionalities.
Please google "processor pipeline multi port parallel" to learn more.
Here I'd like to give a classical example which could be executed by multi-core/multi-channel IMC platforms (e.g. Intel Nehalem family such as Core i7) parallelly, no extra software optimization would be needed.
char buffer0[64];
char buffer1[64];
char buffer2[64];
char buffer[192];
int i;
for (i = 0; i < 64; i++) {
*(buffer + i) = *(buffer0 + i);
*(buffer + 64 + i) = *(buffer1 + i);
*(buffer + 128 + i) = *(buffer2 + i);
}
Why? 3 reasons.
1 Core i7 has a triple-channel IMC, its bus width is 192 bits, 64 bits per channel; and memory address space is interleaved among the channels on a per cache-line basis. cache-line length is 64 bytes. so basicly buffer0 is on channel 0, buffer1 will be on channel and buffer2 on channel 2; while for buffer[192], it was interleaved among 3 channels evently, 64 per channel. The IMC supports loading or storing data from or to multiple channels concurrently. That's multi-channel MC burst w/ maximum throughput. While in my following description, I'll only say 64 bytes per channel, say w/ BL x8 (Burst Length 8, 8 x 8 = 64 bytes = cache-line) per channel.
2 buffer0..2 and buffer are continuous in the memory space (on a specific page both virtually and physically, stack memroy). when run, buffer0, 1, 2 and buffer are loaded/fetched into the processor cache, 6 cache-lines in total. so after start the execution of above "for(){}" code, accessing memory is not necessary at all because all data are in the cache, L3 cache, a non-core part, which is shared by all cores. We'll not talk about L1/2 here. In this case every core could pick the data up and then compute them independently, the only requirement is that the OS supports MP and stealing task is allowed, say runtime scheduling and affinities sharing.
3 there're no any dependencies among buffer0, 1, 2 and buffer, so there're no execution stall or barriers. e.g. execute *(buffer + 64 + i) = *(buffer1 + i) doesn't need to wait the execution of *(buffer + i) = *(buffer0 + i) for done.
Though, the most important and difficult point is "stealing task, runtime scheduling and affinities sharing", that's because for a give task, there's only one task exection context and it should be shared by all cores to perform parallel execution. Anyone if could understand this point, s/he is among the top experts in the world. I'm looking for such an expert to cowork on my open source project and be responsible for parallel computing and latest HPC architectures related works.
Note in above example code, you also could use some SIMD instructions such as movntdq/a which will bypass processor cache and write memory directly. It's a very good idea too when perform software level optimization, though accessing memory is extremely expensive, for example, accessing cache (L1) may need just only 1 cycle, but accessing memory needs 142 cycles on former x86 chips.
Please visit http://effocore.googlecode.com and http://effogpled.googlecode.com to know the details.

Implicit parallelism is probably what you are looking for.

If your application code is single-threaded multiple processors/cores will only be used if:
the libraries you use are using multiple threads (perhaps hiding this usage behind a simple interface)
your application spawns other processes to perform some part of its operation
Ruby, Python and PHP applications can all be written to use multiple threads, however.

A single threaded program will only use one core. The operating system might well decide to shift the program between cores from time to time - according to some rules to balance the load etc. So you will see only 25% usage overall and the all four cores working - but only one at once.

The only way to use multiple cores without using multithreading is to use multiple programs.
In your example above, one program could handle 0-2499999, the next 2500000-4999999, and so on. Set all four of them off at the same time, and they will use all four cores.
Usually you would be better off writing a (single) multithreaded program.

With C/C++ you can use OpenMP. It's C code with pragmas like
#pragma omp parallel for
for(..) {
...
}
to say that this for will run in parallel.
This is one easy way to parallelize something, but at some time you will have to understand how parallel programs execute and will be exposed to parallel programming bugs.

If you want to parallel the choice of the "i"s that evaluate to "true" your statement if (a == ... || b == ...) then you can do this with PLINQ (in .NET 4.0):
//note the "AsParallel"; that's it, multicore support.
var query = from i in Enumerable.Range(0, 100000000).AsParallel()
where (i % 2 == 1 && i >= 10) //your condition
select i;
//while iterating, the query is evaluated in parallel!
//Result will probably never be in order (eg. 13, 11, 17, 15, 19..)
foreach (var selected in query)
{
//not parallel here!
}
If, instead, you want to parallelize operations, you will be able to do:
Parallel.For(0, 100000000, i =>
{
if (i > 10) //your condition here
DoWork(i); //Thread-safe operation
});

Since you are talking about 'task manager', you appear to be running on Windows. However, if you are running a webserver on there (for Ruby or PHP with fcgi or Apache pre-forking, ant to a lesser extent other Apache workers), with multiple processes, then they would tend to spread out across the cores.
If only a single program without threading is running, then, no, no significant advantage will come from that - you're only ruinning one thing at a time, other than OS-driven background processes.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string