Managing large amount of data in OpenCL raytracing

Managing large amount of data in OpenCL raytracing - graphics

I've bounced into a issue with OpenCL. I'm currently writing a raytracer in CL. The issue I met is that I have a lot of vertexes and modifying them is super inefficient.
This is how my current kernel looks.
__kernel void raytraceTheme(__global int* width, __global int*
height, __global Triangle* triangles, __global int* triangleCount,
__global float4* renderBuffer);
Let's say that I have 100K triangles and it takes around 98MB of RAM in my GPU. To update/delete/add some vertexes I have to literally release all the vertex data on OpenCL, modify them in CPU than upload all of them again. This is a ridiculously slow process.
The solution I came up to fix this without limiting how much triangles I can have; is to chunk up my data and create a pointer array so I could access them while raytracing. In case of running out of space in triangleStorage I could simply re-allocate one(which won't take too long anyway). Making something like the following.
__kernel void uploadTriangle(__global Triangle** triangleStorage,
__global Triangle* triangle, __global int* index)
{
triangleStorage[index] = triangle;
}
__kernel void raytraceTheme(__global Triangle** triangleStorage,
, __global int* chunkNum......)
But after some googling and experiments, I found that OpenCL does not support double pointer in kernel arguments. I'm stuck. I have no idea how could I bypass this.

Related

alloc_pages() is paired by __free_pages()

I read the book "Linux Kernel Development", and find some functions that make me confused, listed as bellow:
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
void __free_pages(struct page *page, unsigned int order)
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)
The problem is the use of the two underline in the function name, and how the function pairs.
1. when will the linux kernel uses two underline in its function name?
2. why alloc_pages is paired with __free_pages, but not free_pages?

As you can notice:
alloc_pages() / __free_pages() takes "page *" (page descriptor) as argument.
They are ususally used internally by some infrastrcture kernel code, like page fault handler, which wish to manipulate page descriptor instead of memory block content.
__get_free_pages() / free_pages() takes "unsigned long" (virtual address of memory block) as argument
They could be used by code which wish to use the memory block itself, after allocation, you can read / write to this memory block.
As for their name and double underscore "__", you don't need to bother too much. Sometimes kernel functions were named casually without too much consideration when they were first written. And when people think of that the names are not proper, but later those functions are already used wildly in kernel, and kernel guys are simply lazy to change them.

Why Nvidia Visual Profile shows overlapped data transfer in the timeline for purely synchronized code?

The timeline generated by Nsight Visual Profile looks very strange. I don't write any transfer overlapping code, but you can see overlap between MemCpy and Compute kernels.
This makes me unable to debug the real overlapping code.
I use CUDA 5.0, Tesla M2090, Centos 6.3, 2x CPU Xeon E5-2609
Anyone has the similar problem? Does it occur only on certain linux distributions? How to fix it?
This is the code.
#include <cuda.h>
#include <curand.h>
#include <cublas_v2.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/device_ptr.h>
int main()
{
cublasHandle_t hd;
curandGenerator_t rng;
cublasCreate(&hd);
curandCreateGenerator(&rng, CURAND_RNG_PSEUDO_MTGP32);
const size_t m = 5000, n = 1000;
const double alpha = 1.0;
const double beta = 0.0;
thrust::host_vector<double> h(n * m, 0.1);
thrust::device_vector<double> a(m * n, 0.1);
thrust::device_vector<double> b(n * m, 0.1);
thrust::device_vector<double> c(m * m, 0.1);
cudaDeviceSynchronize();
for (int i = 0; i < 10; i++)
{
curandGenerateUniformDouble(rng,
thrust::raw_pointer_cast(&a[0]), a.size());
cudaDeviceSynchronize();
thrust::copy(h.begin(), h.end(), b.begin());
cudaDeviceSynchronize();
cublasDgemm(hd, CUBLAS_OP_N, CUBLAS_OP_N,
m, m, n, &alpha,
thrust::raw_pointer_cast(&a[0]), m,
thrust::raw_pointer_cast(&b[0]), n,
&beta,
thrust::raw_pointer_cast(&c[0]), m);
cudaDeviceSynchronize();
}
curandDestroyGenerator(rng);
cublasDestroy(hd);
return 0;
}
This is profile timeline captured.

Compute Capability 2.* (Fermi) devices are capable of both kernel level concurrency and kernel and copy concurrency. In order to trace concurrent kernels the kernel start and end timestamps are collected in a separate clock domain than the memory copy timestamps. The tool is responsible for correlating these different clocks. In your screenshot I believe there is a scaling factoring different (bad correlation) as you can see each memory copy is not off by a constant value but is off by a scaled offset.
If you use the option --concurrent-kernels off in nvprof I think the problem will disappear. When concurrent kernels are disabled the memory copy and kernel timing use the same source clock for timestamps.
Compute Capability 3.* (Kepler) and 5.* (Maxwell) have a different mechanism for timing compute kernels. For these devices it is possible in tools to see overlap with the end timestamp of a kernel and the start of a memory copy or kernel. The work does not overlap. There was a design decision in the tools between having potential for overlap (usually <500ns) or introduction of this as a constant overhead between dependent work. The tools decided to avoid introduction of the overhead at the cost of potentially showing very small level of overlap on serialized work.

copy_to_user and copy_from_user with structs

I have a simple question: when i have to copy a structure's content from userspace to kernel space for example with an ioctl call (or viceversa) (for simplicity code hasn't error check):
typedef struct my_struct{
int a;
char b;
} my_struct;
Userspace:
my_struct s;
s.a = 11;
s.b = 'X';
ioctl(fd, MY_CMD, &s);
Kernelspace:
int my_ioctl(struct inode *inode, struct file *filp, unsigned int cmd,
unsigned long arg)
{
...
my_struct ks;
copy_from_user(&ks, (void __user *)arg, sizeof(ks));
...
}
i think that size of structure in userspace (variable s) and kernel space (variable ks) could be not the same (without specify the __attribute__((packed))). So is a right thing specifing the number of byte in copy_from_user with sizeof macro? I see that in kernel sources there are some structures that are not declared as packed so, how is ensured the fact that the size will be the same in userspace and kernelspace?
Thank you all!

Why should the layout of a struct be different in kernel space from user space? There is no reason for the compiler to layout data differently.
The exception is if userspace is a 32bit program running on a 64bit kernel. See http://www.x86-64.org/pipermail/discuss/2002-June/002614.html for a tutorial how to deal with this.

The userspace structure should come from kernel header, so struct definition should be the same in user and kernel space. Do you have any real example ?
Of course, if you play with different packing options on two side of an ABI, whatever it is, you are in trouble. The problem here is not sizeof.
If your question is : does packing options affect binary interface, the answer is yes.
If your question is, how can I solve a packing mismatch, please provide more information

independent searches on GPU -- how to synchronize its finish?

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.
I'm trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.
For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I'm looking to do this on GPUs. But I'm struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:
__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
while (lessThan100truesFound) {
totAttempts[gid]++;
if (generateRandomNumbersAndTestThem())
reportTrue();
}
}
How should I implement this without severe performance loss, given that
triggering of the "if" will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don't even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.
I assume that something can be done similar to Java's "synchronized" modifier, but I fail to find the exact way to do it. What is the "right" way to do this, i.e. any way that works without severe performance loss?

I'd suggest not using global flag to stop kernel, but rather run kernel to do certain amount of attempts, check on host if you have accumulated enough 'successes', and repeat if necessary. Using cycle of undefined length in kernel is bad since GPU driver could be killed by watch-dog timer. Besides, checking some global variable at each iteration would certainly screw kernel performance.
This way, reportTrue could be implemented as atomic_inc to some counter residing in global memory.
__kernel void sampleKernel(all_the_input, __global unsigned long *successes) {
int gid = get_global_id(0);
//here code that localizes all_the_input for faster access
for (int i = 0; i < ATT_PER_THREAD; ++i) {
if (generateRandomNumbersAndTestThem())
atomic_inc(successes);
}
}
ATT_PER_THREAD is to be adjusted depending on how long it takes to execute generateRandomNumbersAndTestThem(). Kernel launch overhead is pretty small, so there usually is no need to make your kernel run more than 0.1--1 second

Is there a way to check whether the processor cache has been flushed recently?

On i386 linux. Preferably in c/(c/posix std libs)/proc if possible. If not is there any piece of assembly or third party library that can do this?
Edit: I'm trying to develop test whether a kernel module clear a cache line or the whole proccesor(with wbinvd()). Program runs as root but I'd prefer to stay in user space if possible.

Cache coherent systems do their utmost to hide such things from you. I think you will have to observe it indirectly, either by using performance counting registers to detect cache misses or by carefully measuring the time to read a memory location with a high resolution timer.
This program works on my x86_64 box to demonstrate the effects of clflush. It times how long it takes to read a global variable using rdtsc. Being a single instruction tied directly to the CPU clock makes direct use of rdtsc ideal for this.
Here is the output:
took 81 ticks
took 81 ticks
flush: took 387 ticks
took 72 ticks
You see 3 trials: The first ensures i is in the cache (which it is, because it was just zeroed as part of BSS), the second is a read of i that should be in the cache. Then clflush kicks i out of the cache (along with its neighbors) and shows that re-reading it takes significantly longer. A final read verifies it is back in the cache. The results are very reproducible and the difference is substantial enough to easily see the cache misses. If you cared to calibrate the overhead of rdtsc() you could make the difference even more pronounced.
If you can't read the memory address you want to test (although even mmap of /dev/mem should work for these purposes) you may be able to infer what you want if you know the cacheline size and associativity of the cache. Then you can use accessible memory locations to probe the activity in the set you're interested in.
Source code:
#include <stdio.h>
#include <stdint.h>
inline void
clflush(volatile void *p)
{
asm volatile ("clflush (%0)" :: "r"(p));
}
inline uint64_t
rdtsc()
{
unsigned long a, d;
asm volatile ("rdtsc" : "=a" (a), "=d" (d));
return a | ((uint64_t)d << 32);
}
volatile int i;
inline void
test()
{
uint64_t start, end;
volatile int j;
start = rdtsc();
j = i;
end = rdtsc();
printf("took %lu ticks\n", end - start);
}
int
main(int ac, char **av)
{
test();
test();
printf("flush: ");
clflush(&i);
test();
test();
return 0;
}

I dont know of any generic command to get the the cache state, but there are ways:
I guess this is the easiest: If you got your kernel module, just disassemble it and look for cache invalidation / flushing commands (atm. just 3 came to my mind: WBINDVD, CLFLUSH, INVD).
You just said it is for i386, but I guess you dont mean a 80386. The problem is that there are many different with different extension and features. E.g. the newest Intel series has some performance/profiling registers for the cache system included, which you can use to evalute cache misses/hits/number of transfers and similar.
Similar to 2, very depending on the system you got. But when you have a multiprocessor configuration you could watch the first cache coherence protocol (MESI) with the 2nd.
You mentioned WBINVD - afaik that will always flush complete, i.e. all, cache lines

It may not be an answer to your specific question, but have you tried using a cache profiler such as Cachegrind? It can only be used to profile userspace code, but you might be able to use it nonetheless, by e.g. moving the code of your function to userspace if it does not depend on any kernel-specific interfaces.
It might actually be more effective than trying to ask the processor for information that may or may not exist and that will be probably affected by your mere asking about it - yes, Heisenberg was way before his time :-)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Managing large amount of data in OpenCL raytracing - graphics

Related

alloc_pages() is paired by __free_pages()

Why Nvidia Visual Profile shows overlapped data transfer in the timeline for purely synchronized code?

copy_to_user and copy_from_user with structs

independent searches on GPU -- how to synchronize its finish?

Is there a way to check whether the processor cache has been flushed recently?

Categories

Resources