CUDA: allocation of an array of structs inside a struct - struct

I've these structs:
typedef struct neuron
{
float* weights;
int n_weights;
}Neuron;
typedef struct neurallayer
{
Neuron *neurons;
int n_neurons;
int act_function;
}NLayer;
"NLayer" struct can contain an arbitrary number of "Neuron"
I've tried to allocate a 'NLayer' struct with 5 'Neurons' from the host in this way:
NLayer* nL;
int i;
int tmp=9;
cudaMalloc((void**)&nL,sizeof(NLayer));
cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron));
for(i=0;i<5;i++)
cudaMemcpy(&nL->neurons[i].n_weights,&tmp,sizeof(int),cudaMemcpyHostToDevice);
...then I've tried to modify the "nL->neurons[0].n_weights" variable with that kernel:
__global__ void test(NLayer* n)
{
n->neurons[0].n_weights=121;
}
but at compiling time nvcc returns that "warning" related to the only line of the kernel:
Warning: Cannot tell what pointer points to, assuming global memory space
and when the kernel finish its work the struct begin unreachable.
It's very probably that I'm doing something wrong during the allocation....can someone helps me??
Thanks very much, and sorry for my english! :)
UPDATE:
Thanks to aland I've modified my code creating this function that should allocate an instance of the struct "NLayer":
NLayer* setNLayer(int numNeurons,int weightsPerNeuron,int act_fun)
{
int i;
NLayer h_layer;
NLayer* d_layer;
float* d_weights;
//SET THE LAYER VARIABLE OF THE HOST NLAYER
h_layer.act_function=act_fun;
h_layer.n_neurons=numNeurons;
//ALLOCATING THE DEVICE NLAYER
if(cudaMalloc((void**)&d_layer,sizeof(NLayer))!=cudaSuccess)
puts("ERROR: Unable to allocate the Layer");
//ALLOCATING THE NEURONS ON THE DEVICE
if(cudaMalloc((void**)&h_layer.neurons,numNeurons*sizeof(Neuron))!=cudaSuccess)
puts("ERROR: Unable to allocate the Neurons of the Layer");
//COPING THE HOST NLAYER ON THE DEVICE
if(cudaMemcpy(d_layer,&h_layer,sizeof(NLayer),cudaMemcpyHostToDevice)!=cudaSuccess)
puts("ERROR: Unable to copy the data layer onto the device");
for(i=0;i<numNeurons;i++)
{
//ALLOCATING THE WEIGHTS' ARRAY ON THE DEVICE
cudaMalloc((void**)&d_weights,weightsPerNeuron*sizeof(float));
//COPING ITS POINTER AS PART OF THE i-TH NEURONS STRUCT
if(cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice)!=cudaSuccess)
puts("Error: unable to copy weights' pointer to the device");
}
//RETURN THE DEVICE POINTER
return d_layer;
}
and i call that function from the main in that way (the kernel "test" is previously declared):
int main()
{
NLayer* nL;
int h_tmp1;
float h_tmp2;
nL=setNLayer(10,12,13);
test<<<1,1>>>(nL);
if(cudaMemcpy(&h_tmp1,&nL->neurons[0].n_weights,sizeof(float),cudaMemcpyDeviceToHost)!=cudaSuccess);
puts("ERROR!!");
printf("RESULT:%d",h_tmp1);
}
When I compile that code the compiler show me the Warning, and when I execute the program it print on screen:
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
Error: unable to copy weights' pointer to the device
ERROR!!
RESULT:1
The last error doesn't not compare if I comment the kernel call.
Where I'm wrong?
I do not know how to do
Thanks for your help!

The problem is here:
cudaMalloc((void**)&nL,sizeof(NLayer));
cudaMalloc((void**)&nL->neurons,6*sizeof(Neuron));
In first line, nL is pointing to structure in global memory on device.
Therefore, in second line the first argument to cudaMalloc is address residing on GPU, which is undefined behaviour (on my test system, it causes segfault; in your case, though, there is something more subtle).
The correct way to do what you want is first to create structure in host memory, fill it with data, and then copy it to device, like this:
NLayer* nL;
NLayer h_nL;
int i;
int tmp=9;
// Allocate data on device
cudaMalloc((void**)&nL, sizeof(NLayer));
cudaMalloc((void**)&h_nL.neurons, 6*sizeof(Neuron));
// Copy nlayer with pointers to device
cudaMemcpy(nL, &h_nL, sizeof(NLayer), cudaMemcpyHostToDevice);
Also, don't forget to always check for any errors from CUDA routines.
UPDATE
In second version of your code:
cudaMemcpy(&d_layer->neurons[i].weights,&d_weights,...) --- again, you are dereferencing device pointer (d_layer) on host. Instead, you should use
cudaMemcpy(&h_layer.neurons[i].weights,&d_weights,sizeof(float*),cudaMemcpyHostToDevice
Here you take h_layer (host structure), read its element (h_layer.neurons), which is pointer to device memory. Then you do some pointer arithmetics on it (&h_layer.neurons[i].weights). No access to device memory is needed to compute this address.

It all depends on the GPU card your using. The Fermi card uses uniform addressing of shared and global memory space, while pre-Fermi cards don't.
For the pre-Fermi case, you don't know if the address should be shared or global. The compiler can usually figure this out, but there are cases where it can't. When a pointer to shared memory is required, you usually take an address of a shared variable and the compiler can recognise this. The message "assuming global" will appear when this is not explicitly defined.
If you are using a GPU that has compute capabiilty of 2.x or higher, it should work with the -arch=sm_20 compiler flag

Related

Function for unaligned memory access on ARM

I am working on a project where data is read from memory. Some of this data are integers, and there was a problem accessing them at unaligned addresses. My idea would be to use memcpy for that, i.e.
uint32_t readU32(const void* ptr)
{
uint32_t n;
memcpy(&n, ptr, sizeof(n));
return n;
}
The solution from the project source I found is similar to this code:
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
So my questions:
Isn't the second version undefined behaviour? The C standard says in 6.2.3.2 Pointers, at 7:
A pointer to an object or incomplete type may be converted to a pointer to a different
object or incomplete type. If the resulting pointer is not correctly aligned 57) for the
pointed-to type, the behavior is undefined.
As the calling code has, at some point, used a char* to handle the memory, there must be some conversion from char* to uint32_t*. Isn't the result of that undefined behaviour, then, if the uint32_t* is not corrently aligned? And if it is, there is no point for the function as you could write *(uint32_t*) to fetch the memory. Additionally, I think I read somewhere that the compiler may expect an int* to be aligned correctly and any unaligned int* would mean undefined behaviour as well, so the generated code for this function might make some shortcuts because it may expect the function argument to be aligned properly.
The original code has volatile on the argument and all variables because the memory contents could change (it's a data buffer (no registers) inside a driver). Maybe that's why it does not use memcpy since it won't work on volatile data. But, in which world would that make sense? If the underlying data can change at any time, all bets are off. The data could even change between those byte copy operations. So you would have to have some kind of mutex to synchronize access to this data. But if you have such a synchronization, why would you need volatile?
Is there a canonical/accepted/better solution to this memory access problem? After some searching I come to the conclusion that you need a mutex and do not need volatile and can use memcpy.
P.S.:
# cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 10 (v7l)
BogoMIPS : 1581.05
Features : swp half thumb fastmult vfp edsp neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc09
CPU revision : 10
This code
uint32_t readU32(const uint32_t* ptr)
{
union {
uint32_t n;
char data[4];
} tmp;
const char* cp=(const char*)ptr;
tmp.data[0] = *cp++;
tmp.data[1] = *cp++;
tmp.data[2] = *cp++;
tmp.data[3] = *cp;
return tmp.n;
}
passes the pointer as a uint32_t *. If it's not actually a uint32_t, that's UB. The argument should probably be a const void *.
The use of a const char * in the conversion itself is not undefined behavior. Per 6.3.2.3 Pointers, paragraph 7 of the C Standard (emphasis mine):
A pointer to an object type may be converted to a pointer to a
different object type. If the resulting pointer is not correctly
aligned for the referenced type, the behavior is undefined.
Otherwise, when converted back again, the result shall compare
equal to the original pointer. When a pointer to an object is
converted to a pointer to a character type, the result points to the
lowest addressed byte of the object. Successive increments of the
result, up to the size of the object, yield pointers to the remaining
bytes of the object.
The use of volatile with respect to the correct way to access memory/registers directly on your particular hardware would have no canonical/accepted/best solution. Any solution for that would be specific to your system and beyond the scope of standard C.
Implementations are allowed to define behaviors in cases where the Standard does not, and some implementations may specify that all pointer types have the same representation and may be freely cast among each other regardless of alignment, provided that pointers which are actually used to access things are suitably aligned.
Unfortunately, because some obtuse compilers compel the use of "memcpy" as an
escape valve for aliasing issues even when pointers are known to be aligned,
the only way compilers can efficiently process code which needs to make
type-agnostic accesses to aligned storage is to assume that any pointer of a type requiring alignment will always be aligned suitably for such type. As a result, your instinct that approach using uint32_t* is dangerous is spot on. It may be desirable to have compile-time checking to ensure that a function is either passed a void* or a uint32_t*, and not something like a uint16_t* or a double*, but there's no way to declare a function that way without allowing a compiler to "optimize" the function by consolidating the byte accesses into a 32-bit load that will fail if the pointer isn't aligned.

Getting ENOTTY on ioctl for a Linux Kernel Module

I have the following chardev defined:
.h
#define MAJOR_NUM 245
#define MINOR_NUM 0
#define IOCTL_MY_DEV1 _IOW(MAJOR_NUM, 0, unsigned long)
#define IOCTL_MY_DEV2 _IOW(MAJOR_NUM, 1, unsigned long)
#define IOCTL_MY_DEV3 _IOW(MAJOR_NUM, 2, unsigned long)
module .c
static long device_ioctl(
struct file* file,
unsigned int ioctl_num,
unsigned long ioctl_param)
{
...
}
static int device_open(struct inode* inode, struct file* file)
{
...
}
static int device_release(struct inode* inode, struct file* file)
{
...
}
struct file_operations Fops = {
.open=device_open,
.unlocked_ioctl= device_ioctl,
.release=device_release
};
static int __init my_dev_init(void)
{
register_chrdev(MAJOR_NUM, "MY_DEV", &Fops);
...
}
module_init(my_dev_init);
My user code
ioctl(fd, IOCTL_MY_DEV1, 1);
Always fails with same error: ENOTTY
Inappropriate ioctl for device
I've seen similar questions:
i.e
Linux kernel module - IOCTL usage returns ENOTTY
Linux Kernel Module/IOCTL: inappropriate ioctl for device
But their solutions didn't work for me
ENOTTY is issued by the kernel when your device driver has not registered a ioctl function to be called. I'm afraid your function is not well registered, probably because you have registered it in the .unlocked_ioctl field of the struct file_operations structure.
Probably you'll get a different result if you register it in the locked version of the function. The most probable cause is that the inode is locked for the ioctl call (as it should be, to avoid race conditions with simultaneous read or write operations to the same device)
Sorry, I have no access to the linux source tree for the proper name of the field to use, but for sure you'll be able to find it yourself.
NOTE
I observe that you have used macro _IOW, using the major number as the unique identifier. This is probably not what you want. First parameter for _IOW tries to ensure that ioctl calls get unique identifiers. There's no general way to acquire such identifiers, as this is an interface contract you create between application code and kernel code. So using the major number is bad practice, for two reasons:
Several devices (in linux, at least) can share the same major number (minor allocation in linux kernel allows this) making it possible for a clash between devices' ioctls.
In case you change the major number (you configure a kernel where that number is already allocated) you have to recompile all your user level software to cope with the new device ioctl ids (all of them change if you do this)
_IOW is a macro built a long time ago (long ago from the birth of linux kernel) that tried to solve this problem, by allowing you to select a different character for each driver (but not dependant of other kernel parameters, for the reasons pointed above) for a device having ioctl calls not clashing with another device driver's. The probability of such a clash is low, but when it happens you can lead to an incorrect machine state (you have issued a valid, working ioctl call to the wrong device)
Ancient unix (and early linux) kernels used different chars to build these calls, so, for example, tty driver used 'T' as parameter for the _IO* macros, scsi disks used 'S', etc.
I suggest you to select a random number (not appearing elsewhere in the linux kernel listings) and then use it in all your devices (probably there will be less drivers you write than drivers in the kernel) and select a different ioctl id for each ioctl call. Maintaining a local ioctl file with the registered ioctls this way is far better than trying to guess a value that works always.
Also, a look at the definition of the _IO* macros should be very illustrative :)

Accessing memory pointers in hardware registers

I'm working on enhancing the stock ahci driver provided in Linux in order to perform some needed tasks. I'm at the point of attempting to issue commands to the AHCI HBA for the hard drive to process. However, whenever I do so, my system locks up and reboots. Trying to explain the process of issuing a command to an AHCI drive is far to much for this question. If needed, reference this link for the full discussion (the process is rather defined all over because there are several pieces, however, ch 4 has the data structures necessary).
Essentially, one writes the appropriate structures into memory regions defined by either the BIOS or the OS. The first memory region I should write to is the Command List Base Address contained in the register PxCLB (and PxCLBU if 64-bit addressing applies). My system is 64 bits and so I'm trying to getting both 32-bit registers. My code is essentially this:
void __iomem * pbase = ahci_port_base(ap);
u32 __iomem *temp = (u32*)(pbase + PORT_LST_ADDR);
struct ahci_cmd_hdr *cmd_hdr = NULL;
cmd_hdr = (struct ahci_cmd_hdr*)(u64)
((u64)(*(temp + PORT_LST_ADDR_HI)) << 32 | *temp);
pr_info("%s:%d cmd_list is %p\n", __func__, __LINE__, cmd_hdr);
// problems with this next line, makes the system reboot
//pr_info("%s:%d cl[0]:0x%08x\n", __func__, __LINE__, cmd_hdr->opts);
The function ahci_port_base() is found in the ahci driver (at least it is for CentOS 6.x). Basically, it returns the proper address for that port in the AHCI memory region. PORT_LST_ADDR and PORT_LST_ADDR_HI are both macros defined in that driver. The address that I get after getting both the high and low addresses is usually something like 0x0000000037900000. Is this memory address in a space that I cannot simply dereference it?
I'm hitting my head against the wall at this point because this link shows that accessing it in this manner is essentially how it's done.
The address that I get after getting both the high and low addresses
is usually something like 0x0000000037900000. Is this memory address
in a space that I cannot simply dereference it?
Yes, you are correct - that's a bus address, and you can't just dereference it because paging is enabled. (You shouldn't be just dereferencing the iomapped addresses either - you should be using readl() / writel() for those, but the breakage here is more subtle).
It looks like the right way to access the ahci_cmd_hdr in that driver is:
struct ahci_port_priv *pp = ap->private_data;
cmd_hdr = pp->cmd_slot;

DMA over PCIe to other device

I am trying to access the DMA address in a NIC directly from another PCIe device in Linux. Specifically, I am trying to read that from an NVIDIA GPU to bypass the CPU all together. I have researched for zero-copy networking and DMA to userspace posts, but they either didn't answer the question or involve some copy from Kernel space to User space. I am trying to avoid using any CPU clocks because of the inconsistency with the delay and I have very tight latency requirements.
I got a hold of the NIC driver for the intel card I use (e1000e driver) and I found where the ring buffers are allocated. As I understood from a previous paper I was reading, I would be interested in the descriptor of type dma_addr_t. They also have a member of the rx_ring struct called dma. I pass both the desc and the dma members using an ioctl call but I am unable to get anything in the GPU besides zeros.
The GPU code is as follows:
int *setup_gpu_dma(u64 addr)
{
// Allocate GPU memory
int *gpu_ptr;
cudaMalloc((void **) &gpu_ptr, MEM_SIZE);
// Allocate memory in user space to read the stuff back
int *h_data;
cudaMallocHost((void **)&h_data, MEM_SIZE);
// Present FPGA memory to CUDA as CPU locked pages
int error = cudaHostRegister((void **) &addr, MEM_SIZE,
CU_MEMHOSTALLOC_DEVICEMAP);
cout << "Allocation error = " << error << endl;
// DMA from GPU memory to FPGA memory
cudaMemcpy((void **) &gpu_ptr, (void **)&addr, MEM_SIZE, cudaMemcpyHostToDevice);
cudaMemcpy((void **) &h_data, (void **)&gpu_ptr, MEM_SIZE, cudaMemcpyDeviceToHost);
// Print the data
// Clean up
}
What am I doing wrong?
cudaHostRegister() operates on already-allocated host memory, so you have to pass addr, not &addr.
If addr is not a host pointer, this will not work. If it is a host pointer, your function interface should use void * and then there will be no need for the typecast.

copy_to_user and copy_from_user with structs

I have a simple question: when i have to copy a structure's content from userspace to kernel space for example with an ioctl call (or viceversa) (for simplicity code hasn't error check):
typedef struct my_struct{
int a;
char b;
} my_struct;
Userspace:
my_struct s;
s.a = 11;
s.b = 'X';
ioctl(fd, MY_CMD, &s);
Kernelspace:
int my_ioctl(struct inode *inode, struct file *filp, unsigned int cmd,
unsigned long arg)
{
...
my_struct ks;
copy_from_user(&ks, (void __user *)arg, sizeof(ks));
...
}
i think that size of structure in userspace (variable s) and kernel space (variable ks) could be not the same (without specify the __attribute__((packed))). So is a right thing specifing the number of byte in copy_from_user with sizeof macro? I see that in kernel sources there are some structures that are not declared as packed so, how is ensured the fact that the size will be the same in userspace and kernelspace?
Thank you all!
Why should the layout of a struct be different in kernel space from user space? There is no reason for the compiler to layout data differently.
The exception is if userspace is a 32bit program running on a 64bit kernel. See http://www.x86-64.org/pipermail/discuss/2002-June/002614.html for a tutorial how to deal with this.
The userspace structure should come from kernel header, so struct definition should be the same in user and kernel space. Do you have any real example ?
Of course, if you play with different packing options on two side of an ABI, whatever it is, you are in trouble. The problem here is not sizeof.
If your question is : does packing options affect binary interface, the answer is yes.
If your question is, how can I solve a packing mismatch, please provide more information

Resources