cudaMemcpyAsync only one member of a struct from device to host - struct

I have a struct with multiple members and I want to do some operations on parts of the member by GPUs. To make the size of communication as small as possible, I hope to copy back only those members which have been modified. Can cuda do that?
struct nodeInfo;
typedef struct nodeInfo
{
int x;
int y;
}nodeProp;
int main(int argc, char* argv[]){
int ngpus;
CHECK(cudaGetDeviceCount(&ngpus));
cudaStream_t stream[ngpus];
nodeProp *Nodes;
nodeProp *gpuNodes[ngpus];
int rankSize = 10;
int deviceSize = rankSize/ngpus;
CHECK(cudaMallocHost((void**)&Nodes,rankSize*sizeof(nodeProp)));
for(int i = 0; i < ngpus; i++)
{
cudaSetDevice(i);
cudaStreamCreate(&stream[i]);
CHECK(cudaMalloc((void**)&gpuNodes[i],deviceSize*sizeof(nodeProp)));
CHECK(cudaMemcpyAsync(gpuNodes[i],Nodes+i*deviceSize,deviceSize*sizeof(nodeProp),cudaMemcpyHostToDevice,stream[i]));
}
for(int i = 0; i < ngpus; i++)
{
cudaSetDevice(i);
kernel_x_Operation<<<grid_size,block_size,0,stream[i]>>>(gpuNodes[i]);//Some operation on gpuNodes.x
//How to write the memcpy function? Can I just copy one member of the struct back?
CHECK((void*)cudaMemcpyAsync((Nodes+i*deviceSize)->x, gpuNodes[i]->x), sizeof(int)*deviceSize,cudaMemcpyDeviceToHost,stream[i]));
cudaDeviceSynchronize();
}
}

No, you can't do that. But you can achieve something similar by laying your data out as a Struct of Arrays instead of an Array of Structs.
Have a look at Structure of Arrays vs Array of Structures in cuda to see how this might even improve performance.

Related

c: using strcpy() on an element inside a data structure

i am currently learning cs50. i am currently studying data structures..i came across a problem...please see if you could help:
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
struct node{
char *name;
int age;
};
typedef struct node node;
int main(){
node *p=malloc(sizeof(node));
if (p==NULL){
return 1;
}
p->name="hussein";
p->age=32;
printf("staff1: %s, %d\n", p->name, p->age); //why doesn't this line of code work? program crashes here
strcpy(p->name, "hasan");
printf("staff1: %s, %d\n", p->name, p->age);
free(p);
return 0;
}
Use p->name = "hasan"; instead of strcpy(p->name, "hasan");.
The name in struct node is a pointer which can point to an array of char.
It didn't have allocated memory space for the char array.

Try to compare 2 methods to implement bounded blocking queue

bounded blocking queue is famous, of course. There are mostly 2 methods to implement it. I try to understand which way is better:
Method 1: use counting semaphore
void *producer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&empty);
sem_wait(&mutex);
put(i);
sem_post(&mutex);
sem_post(&full);
}
}
void *consumer(void *arg) {
int i;
for (i = 0; i < loops; i++) {
sem_wait(&full);
sem_wait(&mutex);
int tmp = get();
sem_post(&mutex);
sem_post(&empty);
printf("%d\n", tmp);
}
}
Method 2: classic monitor pattern
class BoundedBuffer {
private:
int buffer[MAX];
int fill, use;
int fullEntries;
pthread_mutex_t monitor; // monitor lock
pthread_cond_t empty;
pthread_cond_t full;
public:
BoundedBuffer() {
use = fill = fullEntries = 0;
}
void produce(int element) {
pthread_mutex_lock(&monitor);
while (fullEntries == MAX)
pthread_cond_wait(&empty, &monitor);
//do something
pthread_cond_signal(&full);
pthread_mutex_unlock(&monitor);
}
int consume() {
pthread_mutex_lock(&monitor);
while (fullEntries == 0)
pthread_cond_wait(&full, &monitor);
//do something
pthread_cond_signal(&empty);
pthread_mutex_unlock(&monitor);
return tmp;
}
}
I understand the 2nd method can solve a lot of other problems. But how to compare these 2 methods? Looks like they can both fulfill the task.
Is there any link on detailed comparision?
Appreciate your help.
Thanks.
The big difference between those two methods is that the first one does not use pthread_ specific functions (semaphores are not part of pthread) and as such is not guaranteed to work in multithreaded enviornment.
In particular, semaphores do not protect memory ordering, so things written in one thread might not be readable on another. Mutexes are suitable for multi-thread message queue.

Copy struct with function pointer to device

I have a struct containing the parameters of a linear function, as well as the function itself. What I want to do is copy this struct to the device and then evaluate the linear function. The following example doesn't make sense but it is sufficient to describe the difficulties I have:
struct model
{
double* params;
double (*func)(double*, double);
};
I don't know how to copy this struct to the device.
Here are my functions:
Init function
// init function for struct model
__host__ void model_init(model* m, double* params, double(*func)(double*,double))
{
if(m)
{
m->params = params;
m->func = func;
}
}
Evaluation function
__device__ double model_evaluate(model* m, double x)
{
if(m)
{
return m->func(m->params, x);
}
return 0.0;
}
The actual function
__host__ __device__ double linear_function(double* params, double x)
{
return params[0] + params[1] * x;
}
Function called inside kernel
__device__ double compute(model *d_linear_model)
{
return model_evaluate(d_linear_model,1.0);
}
The kernel itself
__global__ void kernel(double *array, model *d_linear_model, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N)
{
array[idx] = compute(d_linear_model);
}
}
I know how to copy an array from host to device but I don't know how to do this for this concrete struct which contains a function.
The kernel call in main then looks like this:
int block_size = 4;
int n_blocks = N_array/block_size + (N_array % block_size == 0 ? 0:1);
kernel<<<n_blocks, block_size>>>(device_array, d_linear_model, N_array);
You've outlined two items that I consider to be somewhat more difficult than beginner-level CUDA programming:
use of device function pointers
a "deep copy" operation (on the embedded params pointer in your model structure)
Both of these topics have been covered in other questions. For example this question/answer discusses deep copy operations - when a data structure has embedded pointers to other data. And this question/answer links to a variety of resources on device function pointer usage.
But I'll go ahead and offer a possible solution for your posted case. Most of your code is usable as-is (at least for demonstration purposes). As mentioned already, your model structure will present two challenges:
struct model
{
double* params; // requires a "deep copy" operation
double (*func)(double*, double); // requires special handling for device function pointers
};
As a result, although most of your code is usable as-is, your "init" function is not. That might work for a host realization, but not for a device realization.
The deep copy operation requires us to copy the overall structure, plus separately copy the data pointed to by the embedded pointer, plus separately copy or "fixup" the embedded pointer itself.
The usage of a device function pointer is restricted by the fact that we cannot grab the actual device function pointer in host code - that is illegal in CUDA. So one possible solution is to use a __device__ construct to "capture" the device function pointer, then do a cudaMemcpyFromSymbol operation in host code, to retrieve the numerical value of the device function pointer, which can then be moved about in ordinary fashion.
Here's a worked example building on what you have shown, demonstrating the two concepts above. I have not created a "device init" function - but all the code necessary to do that is in the main function. Once you've grasped the concepts, you can take whatever code you wish out of the main function below and craft it into your "device init" function, if you wish to create one.
Here's a worked example:
$ cat t968.cu
#include <iostream>
#define NUM_PARAMS 2
#define ARR_SIZE 1
#define nTPB 256
struct model
{
double* params;
double (*func)(double*, double);
};
// init function for struct model -- not using this for device operations
__host__ void model_init(model* m, double* params, double(*func)(double*,double))
{
if(m)
{
m->params = params;
m->func = func;
}
}
__device__ double model_evaluate(model* m, double x)
{
if(m)
{
return m->func(m->params, x);
}
return 0.0;
}
__host__ __device__ double linear_function(double* params, double x)
{
return params[0] + params[1] * x;
}
__device__ double compute(model *d_linear_model)
{
return model_evaluate(d_linear_model,1.0);
}
__global__ void kernel(double *array, model *d_linear_model, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N)
{
array[idx] = compute(d_linear_model);
}
}
__device__ double (*linear_function_ptr)(double*, double) = linear_function;
int main(){
// grab function pointer from device code
double (*my_fp)(double*, double);
cudaMemcpyFromSymbol(&my_fp, linear_function_ptr, sizeof(void *));
// setup model
model my_model;
my_model.params = new double[NUM_PARAMS];
my_model.params[0] = 1.0;
my_model.params[1] = 2.0;
my_model.func = my_fp;
// setup for device copy of model
model *d_model;
cudaMalloc(&d_model, sizeof(model));
// setup "deep copy" for params
double *d_params;
cudaMalloc(&d_params, NUM_PARAMS*sizeof(double));
cudaMemcpy(d_params, my_model.params, NUM_PARAMS*sizeof(double), cudaMemcpyHostToDevice);
// copy model to device
cudaMemcpy(d_model, &my_model, sizeof(model), cudaMemcpyHostToDevice);
// fixup device params pointer in device model
cudaMemcpy(&(d_model->params), &d_params, sizeof(double *), cudaMemcpyHostToDevice);
// run test
double *d_array, *h_array;
cudaMalloc(&d_array, ARR_SIZE*sizeof(double));
h_array = new double[ARR_SIZE];
for (int i = 0; i < ARR_SIZE; i++) h_array[i] = i;
cudaMemcpy(d_array, h_array, ARR_SIZE*sizeof(double), cudaMemcpyHostToDevice);
kernel<<<(ARR_SIZE+nTPB-1)/nTPB,nTPB>>>(d_array, d_model, ARR_SIZE);
cudaMemcpy(h_array, d_array, ARR_SIZE*sizeof(double), cudaMemcpyDeviceToHost);
std::cout << "Results: " << std::endl;
for (int i = 0; i < ARR_SIZE; i++) std::cout << h_array[i] << " ";
std::cout << std::endl;
return 0;
}
$ nvcc -o t968 t968.cu
$ cuda-memcheck ./t968
========= CUDA-MEMCHECK
Results:
3
========= ERROR SUMMARY: 0 errors
$
For brevity of presentation, I've dispensed with proper cuda error checking (instead I have run the code with cuda-memcheck to demonstrate that it is without runtime error) but I would recommend proper error checking if you're having any trouble with a code.

CUDA copy linked lists from device to host

I am trying to populate a number of linked lists on the device and then return those lists back to the hosts.
From my understanding I need to allocate memory for my struct Element, but I don't know how to go about it since I will have many linked lists, each with an unknown number of elements. I've tried a couple of different things but it still didn't work. So I'm back to the starting point. Here is my code:
//NODE CLASS
class Node{
public:
int x,y;
Node *parent;
__device__ __host__ Node(){}
__device__ __host__ Node(int cX, int cY){x = cX; y = cY;}
__device__ __host__ int get_row() { return x; }
__device__ __host__ int get_col() { return y; }
};
//LINKED LIST
class LinkedList{
public:
__device__ __host__ struct Element{
Node n1;
Element *next;
};
__device__ __host__ LinkedList(){
head = NULL;
}
__device__ __host__ void addNode(Node n){
Element *el = new Element();
el->n1 = n;
el->next = head;
head = el;
}
__device__ __host__ Node popFirstNode(){
Element *cur = head;
Node n;
if(cur != NULL){
n = cur -> n1;
head = head -> next;
}
delete cur;
return n;
}
__device__ __host__ bool isEmpty(){
Element *cur = head;
if(cur == NULL){
return true;
}
return false;
}
Element *head;
};
//LISTS
__global__ void listsKernel(LinkedList* d_Results, int numLists){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
Node n(1,1);
if(idx < numLists){
d_Results[idx].addNode(n);
d_Results[idx].addNode(n);
d_Results[idx].addNode(n);
d_Results[idx].addNode(n);
}
}
int main(){
int numLists = 10;
size_t size = numLists * sizeof(LinkedList);
LinkedList curList;
LinkedList* h_Results = (LinkedList*)malloc(size);
LinkedList* d_Results;
cudaMalloc((void**)&d_Results, size);
listsKernel<<<256,256>>>(d_Results, numLists);
cudaMemcpy(h_Results, d_Results, sizeof(LinkedList)*numLists, cudaMemcpyDeviceToHost);
for(int i = 0; i < numLists; i++){
curList = h_Results[i];
while(curList.isEmpty() == false){
Node n = curList.popFirstNode();
std::cout << "x: " << n.get_row() << " y: " << n.get_col();
}
}
}
As you can see I'm trying to populate 10 linked lists on the device and then return them back to the host, but the code above results in unhandled exception - Access violation reading location. I am assuming it is not coping the pointers from the device.
Any help would be great.
Just eyeballing the code, it seems you have a fundamental misconception: there is host memory which cannot be accessed from the device, and device memory which cannot be accessed from the host. So when you create linked list nodes in device memory and copy the pointers back to the host, the host cannot dereference those pointers, because they are pointing to device memory.
If you truly want to pass linked lists back and forth between host and device, your best bet is probably to copy the entries into an array, do the memcpy, then copy the array back into a linked list. Other things can be done too, depending on just what your use case is.
(it is possible to allocate a region of memory that is accessible both from the host and from the device, but there is some awkwardness with it and I have no experience using it)

Memory allocation of struct in C++

My struct is as follows:
typedef struct KeypointSt {
float row, col;
float scale, ori;
unsigned char *descrip; /* Vector of descriptor values */
struct KeypointSt *next;
} *Keypoint;
The following is a part of a code in C. How can I translate it to C++, considering allocation and de-allocation of heap.
Keypoint k, keys = NULL;
for (i = 0; i < num; i++) {
/* Allocate memory for the keypoint. */
k = (Keypoint) malloc(sizeof(struct KeypointSt));
k->next = keys;
keys = k;
k->descrip = malloc(len);
for (j = 0; j < len; j++) {
k->descrip[j] = (unsigned char) val;
}
}
One possible way of converting to C++ is:
#include <cstring> // memset()
typedef struct KeypointSt
{
float row, col;
float scale, ori;
size_t len;
unsigned char *descrip; /* Vector of descriptor values */
KeypointSt *next;
KeypointSt(int p_len, int p_val) : row(0.0), col(0.0), scale(0.0),
ori(0.0), len(p_len),
descrip(new unsigned char[len]), next(0)
{ memset(descrip, len, p_val); }
~KeypointSt() { delete descrip; }
} *Keypoint;
extern KeypointSt *init_keypoints(size_t num, size_t len, unsigned char val);
extern void free_keypoints(KeypointSt *list);
KeypointSt *init_keypoints(size_t num, size_t len, unsigned char val)
{
KeypointSt *keys = NULL;
for (size_t i = 0; i < num; i++)
{
/* Allocate memory for the keypoint. */
KeypointSt *k = new KeypointSt(len, val);
k->next = keys;
keys = k;
}
return keys;
}
void free_keypoints(KeypointSt *list)
{
while (list != 0)
{
KeypointSt *next = list->next;
delete list;
list = next;
}
}
int main(void)
{
KeypointSt *keys = init_keypoints(4, 5, 6);
free_keypoints(keys);
return 0;
}
The only reason I've kept the typedef in place is because you have existing code; the C++ code would be better using KeypointSt * everywhere — or renaming the structure tag to Keypoint and using Keypoint * in place of your original Keypoint. I don't like non-opaque types where the typedef conceals a pointer. If I see a declaration XYZ xyz;, and it is a structure or class type, I expect to use xyz.pqr and not xyz->pqr.
We can debate code layout of the constructor code, the absence of a default constructor (no arrays), and the absence of a copy constructor and an assignment operator (both needed because of the allocation for descrip). The code of init_keypoints() is not exception safe; a memory allocation failure will leak memory. Fixing that is left as an exercise (it isn't very hard, I think, but I don't claim exception-handling expertise). I've not attempted to consider any extra requirements imposed by C++11. Simply translating from C to C++ is 'easy' until you look at the extra demands that C++ makes — demands that make your life easier in the long run, but at a short-term cost in pain.

Resources