OpenCL - how to effectively distribute work items to different devices - multithreading

I'm writing an openCL application where I have N work items that I want to distribute to D devices where N > D and in turn each device can process the elements of its own work item in parallel and thus achieve a sort of "double" parallelism.
Here is the code I have written already to try and achieve this.
First I create a an event for each of my devices and set them all to complete:
cl_int err;
cl_event *events = new cl_event[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
events[i] = clCreateUserEvent(context, &err);
events[i] = clSetUserEventStatus(events[i], CL_COMPLETE);
}
Each device also has its own command queue and its own "instance" of a kernel.
Then I enter into my "main loop" for distributing work items. The code finds the first available device and enqueues it with the work item.
/*---Loop over all available jobs---*/
for(int i = 0; i < numWorkItems; i++)
{
WorkItem item = workItems[i];
bool found = false; //Check for device availability
int index = -1; //Index of found device
while(!found) //Continuously loop until free device is found.
{
for(int j = 0; j < deviceCount; j++) //Total number of CPUs + GPUs
{
cl_int status;
err = clGetEventInfo(events[j], CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &status, NULL);
if(status == CL_COMPLETE) /*Current device has completed all of its tasks*/
{
found = true; //Exit infinite loop
index = j; //Choose current device
break; //Break out of inner loop
}
}
}
//Enqueue my kernel
clSetKernelArg(kernels[index], 0, sizeof(cl_mem), &item);
clEnqueueNDRangeKernel(queues[index], kernels[index], 1, NULL, &glob, &loc, 0, NULL, &events[index]);
clFlush(commandQueues[index]);
}
And then finally I wrap up by calling clFinish on all my devices:
/*---Wait For Completion---*/
for(int i = 0; i < deviceCount; i++)
{
clFinish(queues[i]);
}
This approach has a few problems however:
1) It doesn't distribute the work to all my devices. On my current computer I have 3 devices. My algorithm above only distributes the work to devices 1 and 2. Device 3 always gets left out because devices 1 and 2 finish so quickly that they can snatch up more work items before 3 gets a chance.
2) Even with devices 1 and 2 running together, I only see a very, very mild speed increase. For instance if i were to assign all work items to device 1 it might take 10 seconds to complete, and if I assign all work items to device 2 it might take 11 seconds to complete, but if I try to split the work between them, combined it might take 8-9 seconds when what I would hope for might be between 4-5 seconds. I get the feeling that they might not really be running in parallel with each other the way I want.
How do I fix these issues?

You have to be careful with the sizes and the memory location. Typically these factors are not considered when dealing with GPU devices. I would ask you:
What are the kernel sizes?
How fast do they finish?
If the kernel size is small and they finish quite quickly. Then the overhead of launching them will be high. So the finer granularity of distributing them across many devices does not overcome the extra overhead. In that case is better to directly increase the work size and use 1 device only.
Are the kernels independent? Do they use different buffers?
Another important thing is to have completely different memory for each device, otherwise the memory trashing between devices will delay the kernel launches, and in that case 1 single device (holding all the memory buffers locally) will perform better.
OpenCL will copy to a device all the buffers a kernel uses, and will "block" all the kernels (even in other devices) that use buffers that the kernel is writing to; will wait it to finish and then copy the buffer back to the other device.
Is the host a bottleneck?
The host is sometimes not as fast as you may think, and sometimes the kernels run so fast that the host is a big bottleneck scheduling jobs to them.
If you use the CPU as a CL device, then it cannot do both tasks (act as host and run kernels). You should prefer always GPU devices rather than CPU devices when scheduling kernels.
Never let a device empty
Waiting till a device has finish the execution, before queuing more work is typically a very bad idea. You should queue preemptively kernels in advance (1 or 2) even before the current kernel has finished. Otherwise, the device utilization will not reach not even 80%. Since there is a big amount of time since the kernel finishes till the hosts realizes of it, and even a bigger amount of time until the host queues more data to the kernel (typically >2ms, for a 10ms kernel, thats 33% wasted).
I would do:
Change this line to submitted jobs: if(status >= CL_SUBMITTED)
Ensure the devices are ordered GPU -> CPU. So, the GPUs are the device 0,1 and CPU is the device 2.
Try removing the CPU device (only using the GPUs). Maybe the speed is even better.

Related

Linux UART slower than specified Baudrate

I'm trying to communicate between two Linux systems via UART.
I want to send large chunks of data. With the specified Baudrate it should take around 5 seconds, but it takes nearly 10 times the expected time.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between. If I measure the time needed for the drain and the number of bytes written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
I would expect a slower transmission as the optimal, but not this much.
Did I miss something while setting the UART or while writing? Or is this normal?
The code used for setup:
int bus = open(interface.c_str(), O_RDWR | O_NOCTTY | O_NDELAY); // <- also tryed blocking
if (bus < 0) {
return;
}
struct termios options;
memset (&options, 0, sizeof options);
if(tcgetattr(bus, &options) != 0){
close(bus);
bus = -1;
return;
}
cfsetspeed (&options, B230400);
cfmakeraw(&options); // <- also tried this manually. did not make a difference
if(tcsetattr(bus, TCSANOW, &options) != 0)
{
close(bus);
bus = -1;
return;
}
tcflush(bus, TCIFLUSH);
The code used to send:
int32_t res = write(bus, data, dataLength);
while (res < dataLength){
tcdrain(bus); // <- taking way longer than expected
int32_t r = write(bus, &data[res], dataLength - res);
if(r == 0)
break;
if(r == -1){
break;
}
res += r;
}
B230400
The docs are contradictory. cfsetspeed is documented as requiring a speed_t type, while the note says you need to use one of the "B" constants like "B230400." Have you tried using an actual speed_t type?
In any case, the speed you're supplying is the baud rate, which in this case should get you approximately 23,000 bytes/second, assuming there is no throttling.
The speed is dependent on hardware and link limitations. Also the serial protocol allows pausing the transmission.
FWIW, according to the time and speed you listed, if everything works perfectly, you'll get about 1 MB in 50 seconds. What speed are you actually getting?
Another "also" is the options structure. It's been years since I've had to do any serial I/O, but IIRC, you need to actually set the options that you want and are supported by your hardware, like CTS/RTS, XON/XOFF, etc.
This might be helpful.
As I'm sending more than the buffer can handle at once it is send in small parts and I'm draining the buffer in between.
You have only provided code snippets (rather than a minimal, complete, and verifiable example), so your data size is unknown.
But the Linux kernel buffer size is known. What do you think it is?
(FYI it's 4KB.)
If I measure the time needed for the drain and the number of bytese written to the buffer I calculate a Baudrate nearly 10 times lower than the specified Baudrate.
You're confusing throughput with baudrate.
The maximum throughput (of just payload) of an asynchronous serial link will always be less than the baudrate due to framing overhead per character, which could be two of the ten bits of the frame (assuming 8N1). Since your termios configuration is incomplete, the overhead could actually be three of the eleven bits of the frame (assuming 8N2).
In order to achieve the maximum throughput, the tranmitting UART must saturate the line with frames and never let the line go idle.
The userspace program must be able to supply data fast enough, preferably by one large write() to reduce syscall overhead.
Did I miss something while setting the UART or while writing?
With Linux, you have limited access to the UART hardware.
From userspace your program accesses a serial terminal.
Your program accesses the serial terminal in a sub-optinal manner.
Your termios configuration appears to be incomplete.
It leaves both hardware and software flow-control untouched.
The number of stop bits is untouched.
The Ignore modem control lines and Enable receiver flags are not enabled.
For raw reading, the VMIN and VTIME values are not assigned.
Or is this normal?
There are ways to easily speed up the transfer.
First, your program combines non-blocking mode with non-canonical mode. That's a degenerate combination for receiving, and suboptimal for transmitting.
You have provided no reason for using non-blocking mode, and your program is not written to properly utilize it.
Therefore your program should be revised to use blocking mode instead of non-blocking mode.
Second, the tcdrain() between write() syscalls can introduce idle time on the serial link. Use of blocking mode eliminates the need for this delay tactic between write() syscalls.
In fact with blocking mode only one write() syscall should be needed to transmit the entire dataLength. This would also minimize any idle time introduced on the serial link.
Note that the first write() does not properly check the return value for a possible error condition, which is always possible.
Bottom line: your program would be simpler and throughput would be improved by using blocking I/O.

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)
There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

independent searches on GPU -- how to synchronize its finish?

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.
I'm trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.
For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I'm looking to do this on GPUs. But I'm struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:
__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
while (lessThan100truesFound) {
totAttempts[gid]++;
if (generateRandomNumbersAndTestThem())
reportTrue();
}
}
How should I implement this without severe performance loss, given that
triggering of the "if" will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don't even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.
I assume that something can be done similar to Java's "synchronized" modifier, but I fail to find the exact way to do it. What is the "right" way to do this, i.e. any way that works without severe performance loss?
I'd suggest not using global flag to stop kernel, but rather run kernel to do certain amount of attempts, check on host if you have accumulated enough 'successes', and repeat if necessary. Using cycle of undefined length in kernel is bad since GPU driver could be killed by watch-dog timer. Besides, checking some global variable at each iteration would certainly screw kernel performance.
This way, reportTrue could be implemented as atomic_inc to some counter residing in global memory.
__kernel void sampleKernel(all_the_input, __global unsigned long *successes) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
for (int i = 0; i < ATT_PER_THREAD; ++i) {
if (generateRandomNumbersAndTestThem())
atomic_inc(successes);
}
}
ATT_PER_THREAD is to be adjusted depending on how long it takes to execute generateRandomNumbersAndTestThem(). Kernel launch overhead is pretty small, so there usually is no need to make your kernel run more than 0.1--1 second

Linux Kernel: udelay() returns too early?

I have a driver which requires microsecond delays. To create this delay, my driver is using the kernel's udelay function. Specifically, there is one call to udelay(90):
iowrite32(data, addr + DATA_OFFSET);
iowrite32(trig, addr + CONTROL_OFFSET);
udelay(30);
trig |= 1;
iowrite32(trig, addr + CONTROL_OFFSET);
udelay(90); // This is the problematic call
We had reliability issues with the device. After a lot of debugging, we traced the problem to the driver resuming before 90us has passed. (See "proof" below.)
I am running kernel version 2.6.38-11-generic SMP (Kubuntu 11.04, x86_64) on an Intel Pentium Dual Core (E5700).
As far as I know, the documentation states that udelay will delay execution for at least the specified delay, and is uninterruptible. Is there a bug is this version of the kernel, or did I misunderstand something about the use of udelay?
To convince ourselves that the problem was caused by udelay returning too early, we fed a 100kHz clock to one of the I/O ports and implemented our own delay as follows:
// Wait until n number of falling edges
// are observed
void clk100_delay(void *addr, u32 n) {
int i;
for (i = 0; i < n; i++) {
u32 prev_clk = ioread32(addr);
while (1) {
u32 clk = ioread32(addr);
if (prev_clk && !clk) {
break;
} else {
prev_clk = clk;
}
}
}
}
...and the driver now works flawlessly.
As a final note, I found a discussion indicating that frequency scaling could be causing the *delay() family of functions to misbehave, but this was on a ARM mailing list - I assuming such problems would be non-existent on a Linux x86 based PC.
I don't know of any bug in that version of the kernel (but that doesn't mean that there isn't one).
udelay() isn't "uninterruptible" - it does not disable preemption, so your task can be preempted by a RT task during the delay. However the same is true of your alternate delay implementation, so that is unlikely to be the problem.
Could your actual problem be a DMA coherency / memory ordering issue? Your alternate delay implementation accesses the bus, so this might be hiding the real problem as a side-effect.
The E5700 has X86_FEATURE_CONSTANT_TSC but not X86_FEATURE_NONSTOP_TSC. The TSC is the likely clock source for the udelay. Unless bound to one of the cores with an affinity mask, your task may have been preempted and rescheduled to another CPU during the udelay. Or the TSC might not be stable during lower-power CPU modes.
Can you try disabling interrupts or disabling preemption during the udelay? Also, try reading the TSC before and after.

vmsplice() and TCP

In the original vmsplice() implementation, it was suggested that if you had a user-land buffer 2x the maximum number of pages that could fit in a pipe, a successful vmsplice() on the second half of the buffer would guarantee that the kernel was done using the first half of the buffer.
But that was not true after all, and particularly for TCP, the kernel pages would be kept until receiving ACK from the other side. Fixing this was left as future work, and thus for TCP, the kernel would still have to copy the pages from the pipe.
vmsplice() has the SPLICE_F_GIFT option that sort-of deals with this, but the problem is that this exposes two other problems - how to efficiently get fresh pages from the kernel, and how to reduce cache trashing. The first issue is that mmap requires the kernel to clear the pages, and the second issue is that although mmap might use the fancy kscrubd feature in the kernel, that increases the working set of the process (cache trashing).
Based on this, I have these questions:
What is the current state for notifying userland about the safe re-use of pages? I am especially interested in pages splice()d onto a socket (TCP). Did anything happen during the last 5 years?
Is mmap / vmsplice / splice / munmap the current best practice for zero-copying in a TCP server or have we better options today?
Yes, due to the TCP socket holding on to the pages for an indeterminate time you cannot use the double-buffering scheme mentioned in the example code. Also, in my use case the pages come from circular buffer so I cannot gift the pages to the kernel and alloc fresh pages. I can verify that I am seeing data corruption in the received data.
I resorted to polling the level of the TCP socket's send queue until it drains to 0. This fixes data corruption but is suboptimal because draining the send queue to 0 affects throughput.
n = ::vmsplice(mVmsplicePipe.fd.w, &iov, 1, 0);
while (n) {
// splice pipe to socket
m = ::splice(mVmsplicePipe.fd.r, NULL, mFd, NULL, n, 0);
n -= m;
}
while(1) {
int outsize=0;
int result;
usleep(20000);
result = ::ioctl(mFd, SIOCOUTQ, &outsize);
if (result == 0) {
LOG_NOISE("outsize %d", outsize);
} else {
LOG_ERR_PERROR("SIOCOUTQ");
break;
}
//if (outsize <= (bufLen >> 1)) {
if (outsize == 0) {
LOG("outsize %d <= %u", outsize, bufLen>>1);
break;
}
};

Resources