Why I can't use global float constants in device code? [duplicate] - visual-c++

I am using CUDA 5.0. I noticed that the compiler will allow me to use host-declared int constants within kernels. However, it refuses to compile any kernels that use host-declared float constants. Does anyone know the reason for this seeming discrepancy?
For example, the following code runs just fine as is, but it will not compile if the final line in the kernel is uncommented.
#include <cstdio>
#include <cuda_runtime.h>
static int __constant__ DEV_INT_CONSTANT = 1;
static float __constant__ DEV_FLOAT_CONSTANT = 2.0f;
static int const HST_INT_CONSTANT = 3;
static float const HST_FLOAT_CONSTANT = 4.0f;
__global__ void uselessKernel(float * val)
{
*val = 0.0f;
// Use device int and float constants
*val += DEV_INT_CONSTANT;
*val += DEV_FLOAT_CONSTANT;
// Use host int and float constants
*val += HST_INT_CONSTANT;
//*val += HST_FLOAT_CONSTANT; // won't compile if uncommented
}
int main(void)
{
float * d_val;
cudaMalloc((void **)&d_val, sizeof(float));
uselessKernel<<<1, 1>>>(d_val);
cudaFree(d_val);
}

Adding a const number in the device code is OK, but adding a number stored on the host memory in the device code is NOT.
Every reference of the static const int in your code can be replaced with the value 3 by the compiler/optimizer when the addr of that variable is never referenced. In this case, it is like #define HST_INT_CONSTANT 3, and no host memory is allocated for this variable.
But for float var, the host memory is always allocated even it is of static const float. Since the kernel can not access the host memory directly, your code with static const float won't be compiled.
For C/C++, int can be optimized more aggressively than float.
You code runs when the comment is ON can be seen as a bug of CUDA C I think. The static const int is a host side thing, and should not be accessible to the device directly.

Related

ff_replay substructure in ff_effect empty

I am developing a force feedback driver (linux) for a yet unsupported gamepad.
Whenever a application in userspace requests a ff-effect (e.g rumbling), a function in my driver is called:
static int foo_ff_play(struct input_dev *dev, void *data, struct ff_effect *effect)
this is set by the following code inside my init function:
input_set_capability(dev, EV_FF, FF_RUMBLE);
input_ff_create_memless(dev, NULL, foo_ff_play);
I'm accessing the ff_effect struct (which is passed to my foo_ff_play function) like this:
static int foo_ff_play(struct input_dev *dev, void *data, struct ff_effect *effect)
{
u16 length;
length = effect->replay.length;
printk(KERN_DEBUG "length: %i", length);
return 0;
}
The problem is, that the reported length (in ff_effect->replay) is always zero.
That's confusing, since i am running fftest on my device, and fftest definitely sets the length attribute: https://github.com/flosse/linuxconsole/blob/master/utils/fftest.c (line 308)
/* a strong rumbling effect */
effects[4].type = FF_RUMBLE;
effects[4].id = -1;
effects[4].u.rumble.strong_magnitude = 0x8000;
effects[4].u.rumble.weak_magnitude = 0;
effects[4].replay.length = 5000;
effects[4].replay.delay = 1000;
Does this have something to do with the "memlessness"? Why does the data in ff_replay seem to be zero if it isn't?
Thank you in advance
Why is the replay struct empty?
Taking a look at https://elixir.free-electrons.com/linux/v4.4/source/drivers/input/ff-memless.c#L406 we find:
static void ml_play_effects(struct ml_device *ml)
{
struct ff_effect effect;
DECLARE_BITMAP(handled_bm, FF_MEMLESS_EFFECTS);
memset(handled_bm, 0, sizeof(handled_bm));
while (ml_get_combo_effect(ml, handled_bm, &effect))
ml->play_effect(ml->dev, ml->private, &effect);
ml_schedule_timer(ml);
}
ml_get_combo_effect sets the effect by calling ml_combine_effects., but ml_combine_effects simply does not copy replay.length to the ff_effect struct which is passed to our foo_play_effect (at least not if the effect-type is FF_RUMBLE): https://elixir.free-electrons.com/linux/v4.4/source/drivers/input/ff-memless.c#L286
That's why we cannot read out the ff_replay-data in our foo_play_effect function.
Okay, replay is empty - how can we determine how long we have to play the effect (e.g. FF_RUMBLE) then?
Looks like the replay structure is something we do not even need to carry about. Yes, fftest sets the length and then uploads the effect to the driver, but if we take a look at ml_ff_upload (https://elixir.free-electrons.com/linux/v4.4/source/drivers/input/ff-memless.c#L481), we can see the following:
if (test_bit(FF_EFFECT_STARTED, &state->flags)) {
__clear_bit(FF_EFFECT_PLAYING, &state->flags);
state->play_at = jiffies +
msecs_to_jiffies(state->effect->replay.delay);
state->stop_at = state->play_at +
msecs_to_jiffies(state->effect->replay.length);
state->adj_at = state->play_at;
ml_schedule_timer(ml);
}
That means that the duration is already handled by the input-subsystem. It starts the effect and also stops it as needed.
Furthermore we can see at https://elixir.free-electrons.com/linux/v4.4/source/include/uapi/linux/input.h#L279 that
/*
* All duration values are expressed in ms. Values above 32767 ms (0x7fff)
* should not be used and have unspecified results.
*/
That means that we have to make our effect play at least 32767ms. Everything else (stopping the effect before) is up to the scheduler - which is not our part :D

DLL Floating point results differ according to caller

This is a follow up question to my earlier one asked yesterday
The problems were occurring in a MSVS 2008 C++ DLL that has over 4000 lines of code, but I have managed to produce a simple case that demonstrates the problem as it occurs on my CPU (an AMD Phenom II X6 1050T).
Will it show the problem occurring on another system? I'd really like to know!
Here is a simple class (Point.cpp), it needs to be compiled as a DLL:
#include <math.h>
#define EXPORT extern "C" __declspec(dllexport)
namespace Test {
struct Point {
double x;
double y;
/* Constructor for a Point object */
Point(double xx, double yy) : x(xx), y(yy) {}
/* Copy constructor */
Point(const Point &rhs) : x(rhs.x), y(rhs.y) {}
double mag() const;
Point norm() const;
};
double Point::mag() const {return sqrt(x*x + y*y);}
Point Point::norm() const {
double m = mag();
return Point(x/m, y/m);
}
EXPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny)
Point P = Point(x, y);
Point N = P.norm();
*nx = N.x;
*ny = N.y;
}
}
Here is the test program (TestPoint.c), which needs to be linked to the lib created for the DLL:
#include <stdio.h>
#define IMPORT extern __declspec(dllimport)
IMPORT void __stdcall GetNorm(double x, double y, double *nx, double *ny);
void dhex(double x) { // double to hex
union {
unsigned long n[2];
double d;
} value;
value.d = x;
printf("(0x%0x%0x)\n", value.n[1], value.n[0]);
}
double i64tod(unsigned long long n) { // hex to double
double *DP = (double *) &n;
return *DP;
}
int main(int argc, char **argv) {
double vx, vy;
double ux, uy;
vx = i64tod(0xbfc7a30f3a53d351);
vy = i64tod(0xc01b578b34e3ce1d);
GetNorm(vx, vy, &ux, &uy);
printf(" vx = %20.18f ", vx); dhex(vx);
printf(" vy = %20.18f ", vy); dhex(vy);
printf("\n");
printf(" ux = %20.18f ", ux); dhex(ux);
printf(" uy = %20.18f ", uy); dhex(uy);
return 0;
}
On my system, with TestPoint compiled with VC++, the output is:
vx = -0.18466368053455054 (0xbfc7a30f3a53d351)
vy = -6.8354919685403077 (0xc01b578b34e3ce1d)
ux = -0.027005566159023012 (0xbf9ba758ddda1454,
uy = -0.99963528318903927 (0xbfeffd032227301b)
However, if the same code is compiled with gcc, or indeed, it seems, ANY equivalent program (eg VB6, PowerBasic), the results (ux and uy) are subtly but definitely different (the last hex digit):
vx = -0.184663680534550540 (0xbfc7a30f3a53d351)
vy = -6.835491968540307700 (0xc01b578b34e3ce1d)
ux = -0.027005566159023008 (0xbf9ba758ddda1453)
uy = -0.999635283189039160 (0xbfeffd032227301a)
This might seem an insignificant difference, but when it occurs in a physics engine, these differences accumulate in an alarming fashion. .
If the engine is going to get different results depending on who calls it I might have to abandon the use of VC++ altogether and try g++ instead.
Ok, I think I know how this happens. Looking at a disassembler listing of Point.dll, I noticed that the GetNorm function was pretty much what you'd expect, a couple of FMUL's and FDIV's. What was not present was an FLDCW instruction.
There weren't any FLDCW's in the MSVC calling program either, but I found FLDCW's in both the gcc and a PowerBasic versions of the calling program.
So I tweaked one of the executables (the PowerBasic EXE was the easiest to find the right place to tweak), and hey presto, I then got answers that matched MSVC. Presumably the FLDCW had changed the FPU rounding mode, hence the difference in the least significant bits.

Why does this code works well? It changes the Constant storage area string;

this is a simple linux kernel module code to reverse a string which should Oops after insmod,but it works well,why?
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
static char *words = "words";
static int __init words_init(void)
{
printk(KERN_INFO "debug info\n");
int len = strlen(words);
int k;
for ( k = 0; k < len/2; k++ )
{
printk("the value of k %d\n",k);
char a = words[k];
words[k]= words[len-1-k];
words[len-1-k]=a;
}
printk(KERN_INFO "words is %s\n", words);
return 0;
}
static void __exit words_exit(void)
{
printk(KERN_INFO "words exit.\n");
}
module_init(words_init);
module_exit(words_exit);
module_param(words, charp, S_IRUGO);
MODULE_LICENSE("GPL");
static != const.
static in your example means "only visible in the current file". See What does "static" mean?
const just means "the code which can see this declaration isn't supposed to change the variable". But in certain cases, you can still create a non-const pointer to the same address and modify the contents.
Space removal from a String - In Place C style with Pointers suggests that writing to the string value shouldn't be possible.
Since static is the only difference between your code and that of the question, I looked further and found this: Global initialized variables declared as "const" go to text segment, while those declared "Static" go to data segment. Why?
Try static const char *word, that should do the trick.
Haha!,I get the answer from the linux kernel source code by myself;When you use the insmod,it will call the init_moudle,load_module,strndup_usr then memdup_usr function;the memdup_usr function will use kmalloc_track_caller to alloc memery from slab and then use the copy_from_usr to copy the module paragram into kernel;this mean the linux kernel module paragram store in heap,not in the constant storage area!! So we can change it's content!

CUDA - Array Generating random array on gpu and its modification using kernel

in this code im generating 1D array of floats on a gpu using CUDA. The numbers are between 0 and 1. For my purpose i need them to be between -1 and 1 so i have made simple kernel to multiply each element by 2 and then substract 1 from it. However something is going wrong here. When i print my original array into .bmp i get this http://i.imgur.com/IS5dvSq.png (typical noise pattern). But when i try to modify that array with my kernel i get blank black picture http://imgur.com/cwTVPTG . The program is executable but in the debug i get this:
First-chance exception at 0x75f0c41f in Midpoint_CUDA_Alpha.exe:
Microsoft C++ exception: cudaError_enum at memory location
0x003cfacc..
First-chance exception at 0x75f0c41f in Midpoint_CUDA_Alpha.exe:
Microsoft C++ exception: cudaError_enum at memory location
0x003cfb08..
First-chance exception at 0x75f0c41f in Midpoint_CUDA_Alpha.exe:
Microsoft C++ exception: [rethrow] at memory location 0x00000000..
i would be thankfull for any help or even little hint in this matter. Thanks !
(edited)
#include <device_functions.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include "stdafx.h"
#include "EasyBMP.h"
#include <curand.h> //curand.lib must be added in project propetties > linker > input
#include "device_launch_parameters.h"
float *heightMap_cpu;
float *randomArray_gpu;
int randCount = 0;
int rozmer = 513;
void createRandoms(int size){
curandGenerator_t generator;
cudaMalloc((void**)&randomArray_gpu, size*size*sizeof(float));
curandCreateGenerator(&generator,CURAND_RNG_PSEUDO_XORWOW);
curandSetPseudoRandomGeneratorSeed(generator,(int)time(NULL));
curandGenerateUniform(generator,randomArray_gpu,size*size);
}
__global__ void polarizeRandoms(int size, float *randomArray_gpu){
int index = threadIdx.x + blockDim.x * blockIdx.x;
if(index<size*size){
randomArray_gpu[index] = randomArray_gpu[index]*2.0f - 1.0f;
}
}
//helper fucnction for getting address in 1D using 2D coords
int ad(int x,int y){
return x*rozmer+y;
}
void printBmp(){
BMP AnImage;
AnImage.SetSize(rozmer,rozmer);
AnImage.SetBitDepth(24);
int i,j;
for(i=0;i<=rozmer-1;i++){
for(j=0;j<=rozmer-1;j++){
AnImage(i,j)->Red = (int)((heightMap_cpu[ad(i,j)]*127)+128);
AnImage(i,j)->Green = (int)((heightMap_cpu[ad(i,j)]*127)+128);
AnImage(i,j)->Blue = (int)((heightMap_cpu[ad(i,j)]*127)+128);
AnImage(i,j)->Alpha = 0;
}
}
AnImage.WriteToFile("HeightMap.bmp");
}
int main(){
createRandoms(rozmer);
polarizeRandoms<<<((rozmer*rozmer)/1024)+1,1024>>>(rozmer,randomArray_gpu);
heightMap_cpu = (float*)malloc((rozmer*rozmer)*sizeof(float));
cudaMemcpy(heightMap_cpu,randomArray_gpu,rozmer*rozmer*sizeof(float),cudaMemcpyDeviceToHost);
printBmp();
//cleanup
cudaFree(randomArray_gpu);
free(heightMap_cpu);
return 0;
}
This is wrong:
cudaMalloc((void**)&randomArray_gpu, size*size*sizeof(float));
We don't use cudaMalloc with __device__ variables. If you do proper cuda error checking I'm pretty sure that line will throw an error.
If you really want to use a __device__ pointer this way, you need to create a separate normal pointer, cudaMalloc that, then copy the pointer value to the device pointer using cudaMemcpyToSymbol:
float *my_dev_pointer;
cudaMalloc((void**)&my_dev_pointer, size*size*sizeof(float));
cudaMemcpyToSymbol(randomArray_gpu, &my_dev_pointer, sizeof(float *));
Whenever you are having trouble with your CUDA programs, you should do proper cuda error checking. It will likely focus your attention on what is wrong.
And, yes, kernels can access __device__ variables without the variable being passed explicitly as a parameter to the kernel.
The programming guide covers the proper usage of __device__ variables and the api functions that should be used to access them from the host.

C++/CLI from tracking reference to (native) reference - wrapping

I need a C# interface to call some native C++ code via the CLI dialect. The C# interface uses the out attribute specifier in front of the required parameters. That translates to a % tracking reference in C++/CLI.
The method I has the following signature and body (it is calling another native method to do the job):
virtual void __clrcall GetMetrics(unsigned int %width, unsigned int %height, unsigned int %colourDepth, int %left, int %top) sealed
{
mRenderWindow->getMetrics(width, height, colourDepth, left, top);
}
Now the code won't compile because of a few compile time errors (all being related to not being able to convert parameter 1 from 'unsigned int' to 'unsigned int &').
As a modest C++ programmer, to me CLI is looking like Dutch to a German speaker. What can be done to make this wrapper work properly in CLI?
Like it was also suggested in a deleted answer, I did the obvious and used local variables to pass the relevant values around:
virtual void __clrcall GetMetrics(unsigned int %width, unsigned int %height, unsigned int %colourDepth, int %left, int %top) sealed
{
unsigned int w = width, h = height, c = colourDepth;
int l = left, t = top;
mRenderWindow->getMetrics(w, h, c, l, t);
width = w; height = h; colourDepth = c; left = l; top = t;
}
It was a bit obvious since the rather intuitive mechanism of tracked references: they're affected by the garbage collector's work and are not really that static/constant as normal &references when they're prone to be put somewhere else in memory. Thus this is the only way reliable enough to overcome the issue. Thanks to the initial answer.
If your parameters use 'out' on the C# side, you need to define your C++/CLI parameters like this: [Out] unsigned int ^%width
Here's an example:
virtual void __clrcall GetMetrics([Out] unsigned int ^%width)
{
width = gcnew UInt32(42);
}
Then on your C# side, you'll get back 42:
ValueType vt;
var res = cppClass.GetMetrics(out vt);
//vt == 42
In order to use the [Out] parameter on the C++/CLI side you'll need to include:
using namespace System::Runtime::InteropServices;
Hope this helps!
You can use pin_ptr so that 'width' doesn't move when native code changes it. The managed side suffers from pin_ptr, but I don't think you can get around that if you want native code directly access it without 'w'.
virtual void __clrcall GetMetrics(unsigned int %width, unsigned int %height, unsigned int %colourDepth, int %left, int %top) sealed
{
pin_ptr<unsigned int> pw = &width; //do the same for height
mRenderWindow->getMetrics(*pw, h, c, l, t);
}

Resources