independent searches on GPU -- how to synchronize its finish?

independent searches on GPU -- how to synchronize its finish? - multithreading

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.
I'm trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.
For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I'm looking to do this on GPUs. But I'm struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:
__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
while (lessThan100truesFound) {
totAttempts[gid]++;
if (generateRandomNumbersAndTestThem())
reportTrue();
}
}
How should I implement this without severe performance loss, given that
triggering of the "if" will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don't even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.
I assume that something can be done similar to Java's "synchronized" modifier, but I fail to find the exact way to do it. What is the "right" way to do this, i.e. any way that works without severe performance loss?

I'd suggest not using global flag to stop kernel, but rather run kernel to do certain amount of attempts, check on host if you have accumulated enough 'successes', and repeat if necessary. Using cycle of undefined length in kernel is bad since GPU driver could be killed by watch-dog timer. Besides, checking some global variable at each iteration would certainly screw kernel performance.
This way, reportTrue could be implemented as atomic_inc to some counter residing in global memory.
__kernel void sampleKernel(all_the_input, __global unsigned long *successes) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
for (int i = 0; i < ATT_PER_THREAD; ++i) {
if (generateRandomNumbersAndTestThem())
atomic_inc(successes);
}
}
ATT_PER_THREAD is to be adjusted depending on how long it takes to execute generateRandomNumbersAndTestThem(). Kernel launch overhead is pretty small, so there usually is no need to make your kernel run more than 0.1--1 second

Related

Is it safe to attempt (and fail) to write to a const on an STM32?

So, we are experimenting with an approach to perform some matrix math. This is embedded, so memory is limited, and we will have large matrices so it helps us to keep some of them stored in flash rather than RAM.
I've written a matrix structure, two arrays (one const/flash and the other RAM), and a "modify" and "get" function. One matrix, I initialize to the RAM data, and the other matrix I initialize to the flash data, using a cast from const *f32 to *f32.
What I find is that when I run this code on my STM32 embedded processor, the RAM matrix is modifiable, and the matrix pointing to the flash data simply doesn't change (the set to 12.0 doesn't "take", the value remains 2.0).
(before change) a=2, b=2, (after change) c=2, d=12
This is acceptable behavior, by design we will not attempt to modify matrices of flash data, but if we make a mistake we don't want it to crash.
If I run the same code on my windows machine with Visual C++, however, I get an "access violation" when I attempt to run the code below, when I try to modify the const array to 12.0.
This is not surprising that Windows would object, but I'd like to understand the difference in behavior better. This seems related to CPU architecture. Is it safe, on our STM32, to let the code attempt to write to a const array and let it have no effect? Or are there side effects, or reasons to avoid this?
static const f32 constarray[9] = {1,2,3,1,2,3,1,2,3};
static f32 ramarray[9] = {1,2,3,1,2,3,1,2,3};
typedef struct {
u16 rows;
u16 cols;
f32 * mat;
} matrix_versatile;
void modify_versatile_matrix(matrix_versatile * m, uint16_t r, uint16_t c, double new_value)
{
m->mat[r * m->cols + c] = new_value;
}
double get_versatile_matrix_value(matrix_versatile * m, uint16_t r, uint16_t c)
{
return m->mat[r * m->cols + c];
}
double a;
double b;
double c;
double d;
int main(void)
{
matrix_versatile matrix_with_const_data;
matrix_versatile matrix_with_ram_data;
matrix_with_const_data.cols = 3;
matrix_with_const_data.rows = 3;
matrix_with_const_data.mat = (f32 *) constarray;
matrix_with_ram_data.cols = 3;
matrix_with_ram_data.rows = 3;
matrix_with_ram_data.mat = ramarray;
a = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
b = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);
modify_versatile_matrix(&matrix_with_const_data, 1, 1, 12.0);
modify_versatile_matrix(&matrix_with_ram_data, 1, 1, 12.0);
c = get_versatile_matrix_value(&matrix_with_const_data, 1, 1);
d = get_versatile_matrix_value(&matrix_with_ram_data, 1, 1);

but if we make a mistake we don't want it to crash.
Attempting to write to ROM will not in itself cause a crash, but the code attempting to write it is by definition buggy and may crash in any case, and will certainly not behave as intended.
It is almost entirely wrong thinking; if you have a bug, you really want it to crash during development, and not after deployment. If it silently does the wrong thing, you may never notice the bug, or the crash might occur somewhere other than in proximity of the bug, so be very hard to find.
Architectures an MMU or MPU may issue an exception if you attempt to write to memory marked as read-only. That is what is happening in Windows. In that case it can be a useful debug aid given an exception handler that reports such errors by some means. In this case the error is reported exactly when it occurs, rather than crashing some time later when some invalid data is accessed or incorrect result acted upon.
Some, but mot all STM32 parts include the MPU (application note)

The answer may depend on the series (STM32F1, STM32F4, STM32L1 etc), as they have somewhat different flash controllers.
I've once made the same mistake on an STM32F429, and investigated a bit, so I can tell what would happen on an STM32F4.
Probably nothing.
The flash is by default protected, in order to be somewhat resilient to those kind of programming errors. In order to modify the flash, one has to write certain values to the FLASH->KEYR register. If the wrong value is written, then the flash will be locked until reset, so nothing really bad can happen unless the program writes 64 bits of correct values. No unexpected interrupts can happen, because the interrupt enable bit is protected by this key too. The attempt will set some error bits in FLASH->SR, so a program can check it and warn the user (preferably the tester).
However if there is some code there (e.g. a bootloader, or logging something into flash) that is supposed to write something in the flash, i.e. it unlocks the flash with the correct keys, then bad things can happen.
If the flash is left unlocked after a preceding write operation, then writing to a previously programmed area will change bits from 1 to 0, but not from 0 to 1. It means that the flash will contain the bitwise AND of the old and the newly written value.
If the failed write attempt occurs first, and unlocked afterwards, then no legitimate write or erase operation would succeed unless the status bits are properly cleared first.
If the intended and unintended accesses occur interleaved, e.g. in interrupt handlers, then all bets are off.
Even if the values are in immutable flash memory, there can still be unexpected result. Consider this code
int foo(int *array) {
array[0] = 1;
array[1] = 3;
array[2] = 5;
return array[0];
}
An optimizing compiler might recognize that the return value should always be 1, and emit code to that effect. Or it might not, and reload array[0] from wherever it is stored, possibly a different value from flash. It may behave differently in debug and release builds, or when the function is called from different places, as it might be inlined differently.
If the pointer points to an unmapped area, neither RAM nor FLASH nor some memory mapped register, then a a fault will occur, and as the default fault handlers contain just an infinite loop, the program will hang unless it has a fault handler installed that can deal with the situation. Needless to say, overwriting random RAM areas or registers can result in barely predictable behaviour.
UPDATE
I've tried your code on actual hardware. When I ran it verbatim, the compiler (gcc-arm-none-eabi-7-2018-q2-update -O3 -lto) optimized away everything, since the variables were not used afterwards. Marking a, b, c, d as volatile resulted in c=2 and d=12, it was still considering the first array const, and no accesses to the arrays were generated. constarray did not show up in the map file at all, the linker had eliminated it completely.
So I've tried a few things one at a time to force the optimizer to generate code that would actually access the arrays.
Disablig optimization (-O0)
Making all variables volatile
Inserting a couple of compile-time memory barriers (asm volatile("":::"memory");
Doing some complex calculations in the middle
Any of these has produced varying effects on different MCUs, but they were always consistent on a single platform.
STM32F103: Hard Fault. Only halfword (16 bit) write accesses are allowed to the flash, 8 or 32 bits always result in a fault. When I've changed the data types to short, the code ran, of course without any effect on the flash.
STM32F417: Code runs, with no effects on the flash contents, but bits 6 and 7, PGPERR and PGSERR in FLASH->SR were set a few cycles after the first write attempt to constarray.
STM32L151: Code runs, with no effects on the flash controller status.

OpenCL - how to effectively distribute work items to different devices

I'm writing an openCL application where I have N work items that I want to distribute to D devices where N > D and in turn each device can process the elements of its own work item in parallel and thus achieve a sort of "double" parallelism.
Here is the code I have written already to try and achieve this.
First I create a an event for each of my devices and set them all to complete:
cl_int err;
cl_event *events = new cl_event[deviceCount];
for(int i = 0; i < deviceCount; i++)
{
events[i] = clCreateUserEvent(context, &err);
events[i] = clSetUserEventStatus(events[i], CL_COMPLETE);
}
Each device also has its own command queue and its own "instance" of a kernel.
Then I enter into my "main loop" for distributing work items. The code finds the first available device and enqueues it with the work item.
/*---Loop over all available jobs---*/
for(int i = 0; i < numWorkItems; i++)
{
WorkItem item = workItems[i];
bool found = false; //Check for device availability
int index = -1; //Index of found device
while(!found) //Continuously loop until free device is found.
{
for(int j = 0; j < deviceCount; j++) //Total number of CPUs + GPUs
{
cl_int status;
err = clGetEventInfo(events[j], CL_EVENT_COMMAND_EXECUTION_STATUS, sizeof(cl_int), &status, NULL);
if(status == CL_COMPLETE) /*Current device has completed all of its tasks*/
{
found = true; //Exit infinite loop
index = j; //Choose current device
break; //Break out of inner loop
}
}
}
//Enqueue my kernel
clSetKernelArg(kernels[index], 0, sizeof(cl_mem), &item);
clEnqueueNDRangeKernel(queues[index], kernels[index], 1, NULL, &glob, &loc, 0, NULL, &events[index]);
clFlush(commandQueues[index]);
}
And then finally I wrap up by calling clFinish on all my devices:
/*---Wait For Completion---*/
for(int i = 0; i < deviceCount; i++)
{
clFinish(queues[i]);
}
This approach has a few problems however:
1) It doesn't distribute the work to all my devices. On my current computer I have 3 devices. My algorithm above only distributes the work to devices 1 and 2. Device 3 always gets left out because devices 1 and 2 finish so quickly that they can snatch up more work items before 3 gets a chance.
2) Even with devices 1 and 2 running together, I only see a very, very mild speed increase. For instance if i were to assign all work items to device 1 it might take 10 seconds to complete, and if I assign all work items to device 2 it might take 11 seconds to complete, but if I try to split the work between them, combined it might take 8-9 seconds when what I would hope for might be between 4-5 seconds. I get the feeling that they might not really be running in parallel with each other the way I want.
How do I fix these issues?

You have to be careful with the sizes and the memory location. Typically these factors are not considered when dealing with GPU devices. I would ask you:
What are the kernel sizes?
How fast do they finish?
If the kernel size is small and they finish quite quickly. Then the overhead of launching them will be high. So the finer granularity of distributing them across many devices does not overcome the extra overhead. In that case is better to directly increase the work size and use 1 device only.
Are the kernels independent? Do they use different buffers?
Another important thing is to have completely different memory for each device, otherwise the memory trashing between devices will delay the kernel launches, and in that case 1 single device (holding all the memory buffers locally) will perform better.
OpenCL will copy to a device all the buffers a kernel uses, and will "block" all the kernels (even in other devices) that use buffers that the kernel is writing to; will wait it to finish and then copy the buffer back to the other device.
Is the host a bottleneck?
The host is sometimes not as fast as you may think, and sometimes the kernels run so fast that the host is a big bottleneck scheduling jobs to them.
If you use the CPU as a CL device, then it cannot do both tasks (act as host and run kernels). You should prefer always GPU devices rather than CPU devices when scheduling kernels.
Never let a device empty
Waiting till a device has finish the execution, before queuing more work is typically a very bad idea. You should queue preemptively kernels in advance (1 or 2) even before the current kernel has finished. Otherwise, the device utilization will not reach not even 80%. Since there is a big amount of time since the kernel finishes till the hosts realizes of it, and even a bigger amount of time until the host queues more data to the kernel (typically >2ms, for a 10ms kernel, thats 33% wasted).
I would do:
Change this line to submitted jobs: if(status >= CL_SUBMITTED)
Ensure the devices are ordered GPU -> CPU. So, the GPUs are the device 0,1 and CPU is the device 2.
Try removing the CPU device (only using the GPUs). Maybe the speed is even better.

DMA memcpy operation in Linux

I want to dma using dma_async_memcpy_buf_to_buf function which is in dmaengine.c file (linux/drivers/dma). For this, I add a function in dmatest.c file (linux/drivers/dma) as following:
void foo ()
{
int index = 0;
dma_cookie_t cookie;
size_t len = 0x20000;
ktime_t start, end, end1, end2, end3;
s64 actual_time;
u16* dest;
u16* src;
dest = kmalloc(len, GFP_KERNEL);
src = kmalloc(len, GFP_KERNEL);
for (index = 0; index < len/2; index++)
{
dest[index] = 0xAA55;
src[index] = 0xDEAD;
}
start = ktime_get();
cookie = dma_async_memcpy_buf_to_buf(chan, dest, src, len);
while (dma_async_is_tx_complete(chan, cookie, NULL, NULL) == DMA_IN_PROGRESS)
{
dma_sync_wait(chan, cookie);
}
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution dma: %lld\n",(long long)actual_time);
memset(dest, 0 , len);
start = ktime_get();
memcpy(dest, src, len);
end = ktime_get();
actual_time = ktime_to_ns(ktime_sub(end, start));
printk("Time taken for function() execution non-dma: %lld\n",(long long)actual_time);
}
There are some issues with DMA:
Interestingly, memcpy function execution time is less than dma_async_memcpy_buf_to_buf function. Maybe, its related with ktime_get() function problem.
My method with foo function is correct or incorrect to perform DMA operation? I'm not sure about this.
How can I measure tick counts of memcpy and dma_async_memcpy_buf_to_buf functions in terms of cpu usage
Finally, Is DMA operation possible at application level? Up to now I used in kernel level as you can see above(dmatest.c is inserted kernel module)

There are multiple issues in your question, which make it kind of hard to answer exactly what you're questioning:
Yes, your general DMA operation invocation algorithm is correct.
The fundamental difference in using plain memcpy and DMA operations for copying memory is not getting direct performance gains, but (a) performance gains due to sustaining CPU cache / prefetcher state when using DMA operation (which is likely would be garbled when using plain old memcpy, executed on CPU itself), and (b) true background operation which leaves CPU available to do other stuff.
Given (a), it's kind of pointless to use DMA operations on anything less than CPU cache size, i.e. dozens of megabytes. Typically it's done for purposes of fast off-CPU stream processing, i.e. moving data that would be anyhow produced/consumed by external devices, such as fast networking cards, video streaming / capturing / encoding hardware, etc.
Comparing async and sync operations in terms of wall clock elapsed time is wrong. There might be hundreds of threads / processes running and no one guarantees you that you'd get scheduled next tick and not several thousands ticks later.
Using ktime_get for benchmarking purposes is wrong - it's fairly imprecise, especially for given such short jobs. Profiling kernel code in fact is a pretty hard and complex task which is well beyond the scope of this question. A quick recommendation here would be to refrain at all from such micro-benchmarks and profile a much bigger and more complete job - similar to what you're ultimately trying to achieve.
Measuring "ticks" for modern CPUs is also kind of pointless, although you can use CPU vendor-specific tools, such as Intel's VTune.
Using DMA copy operations on application level is fairly pointless - at least I can't come with a single viable scenario from top of my head when it would be worth the trouble. It's not innately faster, and, what's more important, I seriously doubt that your application performance's bottleneck is memory copying. For this to be the case, you generally should be doing everything else faster than regular memory copying, and I can't really think of anything on application level that would be faster than memcpy. And if we're talking about communication with some other, off-CPU processing device, then it's automatically not application level.
Generally, memory copy performance is usually limited by memory speed, i.e. clock freq and timings. You aren't going to get any miracle boosts over regular memcpy in direct performance, just because memcpy executed on CPU is fast enough, as CPU usually works with 3x-5x-10x faster clock frequencies than memory.

malloc/realloc/free capacity optimization

When you have a dynamically allocated buffer that varies its size at runtime in unpredictable ways (for example a vector or a string) one way to optimize its allocation is to only resize its backing store on powers of 2 (or some other set of boundaries/thresholds), and leave the extra space unused. This helps to amortize the cost of searching for new free memory and copying the data across, at the expense of a little extra memory use. For example the interface specification (reserve vs resize vs trim) of many C++ stl containers have such a scheme in mind.
My question is does the default implementation of the malloc/realloc/free memory manager on Linux 3.0 x86_64, GLIBC 2.13, GCC 4.6 (Ubuntu 11.10) have such an optimization?
void* p = malloc(N);
... // time passes, stuff happens
void* q = realloc(p,M);
Put another way, for what values of N and M (or in what other circumstances) will p == q?

From the realloc implementation in glibc trunk at http://sources.redhat.com/git/gitweb.cgi?p=glibc.git;a=blob;f=malloc/malloc.c;h=12d2211b0d6603ac27840d6f629071d1c78586fe;hb=HEAD
First, if the memory has been obtained via mmap() instead of sbrk(), which glibc malloc does for large requests, >= 128 kB by default IIRC:
if (chunk_is_mmapped(oldp))
{
void* newmem;
#if HAVE_MREMAP
newp = mremap_chunk(oldp, nb);
if(newp) return chunk2mem(newp);
#endif
/* Note the extra SIZE_SZ overhead. */
if(oldsize - SIZE_SZ >= nb) return oldmem; /* do nothing */
/* Must alloc, copy, free. */
newmem = public_mALLOc(bytes);
if (newmem == 0) return 0; /* propagate failure */
MALLOC_COPY(newmem, oldmem, oldsize - 2*SIZE_SZ);
munmap_chunk(oldp);
return newmem;
}
(Linux has mremap(), so in practice this is what is done).
For smaller requests, a few lines below we have
newp = _int_realloc(ar_ptr, oldp, oldsize, nb);
where _int_realloc is a bit big to copy-paste here, but you'll find it starting at line 4221 in the link above. AFAICS, it does NOT do the constant factor optimization increase that e.g. the C++ std::vector does, but rather allocates exactly the amount requested by the user (rounded up to the next chunk boundaries + alignment stuff and so on).
I suppose the idea is that if the user wants this factor of 2 size increase (or any other constant factor increase in order to guarantee logarithmic efficiency when resizing multiple times), then the user can implement it himself on top of the facility provided by the C library.

Perhaps you can use malloc_usable_size (google for it) to find the answer experimentally. This function, however, seems undocumented, so you will need to check out if it is still available at your platform.
See also How to find how much space is allocated by a call to malloc()?

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string