Memory allocation of struct in C++ - struct

My struct is as follows:
typedef struct KeypointSt {
float row, col;
float scale, ori;
unsigned char *descrip; /* Vector of descriptor values */
struct KeypointSt *next;
} *Keypoint;
The following is a part of a code in C. How can I translate it to C++, considering allocation and de-allocation of heap.
Keypoint k, keys = NULL;
for (i = 0; i < num; i++) {
/* Allocate memory for the keypoint. */
k = (Keypoint) malloc(sizeof(struct KeypointSt));
k->next = keys;
keys = k;
k->descrip = malloc(len);
for (j = 0; j < len; j++) {
k->descrip[j] = (unsigned char) val;

One possible way of converting to C++ is:
#include <cstring> // memset()
typedef struct KeypointSt
float row, col;
float scale, ori;
size_t len;
unsigned char *descrip; /* Vector of descriptor values */
KeypointSt *next;
KeypointSt(int p_len, int p_val) : row(0.0), col(0.0), scale(0.0),
ori(0.0), len(p_len),
descrip(new unsigned char[len]), next(0)
{ memset(descrip, len, p_val); }
~KeypointSt() { delete descrip; }
} *Keypoint;
extern KeypointSt *init_keypoints(size_t num, size_t len, unsigned char val);
extern void free_keypoints(KeypointSt *list);
KeypointSt *init_keypoints(size_t num, size_t len, unsigned char val)
KeypointSt *keys = NULL;
for (size_t i = 0; i < num; i++)
/* Allocate memory for the keypoint. */
KeypointSt *k = new KeypointSt(len, val);
k->next = keys;
keys = k;
return keys;
void free_keypoints(KeypointSt *list)
while (list != 0)
KeypointSt *next = list->next;
delete list;
list = next;
int main(void)
KeypointSt *keys = init_keypoints(4, 5, 6);
return 0;
The only reason I've kept the typedef in place is because you have existing code; the C++ code would be better using KeypointSt * everywhere — or renaming the structure tag to Keypoint and using Keypoint * in place of your original Keypoint. I don't like non-opaque types where the typedef conceals a pointer. If I see a declaration XYZ xyz;, and it is a structure or class type, I expect to use xyz.pqr and not xyz->pqr.
We can debate code layout of the constructor code, the absence of a default constructor (no arrays), and the absence of a copy constructor and an assignment operator (both needed because of the allocation for descrip). The code of init_keypoints() is not exception safe; a memory allocation failure will leak memory. Fixing that is left as an exercise (it isn't very hard, I think, but I don't claim exception-handling expertise). I've not attempted to consider any extra requirements imposed by C++11. Simply translating from C to C++ is 'easy' until you look at the extra demands that C++ makes — demands that make your life easier in the long run, but at a short-term cost in pain.


Creating dynamically sized MPI file views

I would like to write out a binary file using collective MPI I/O. My plan is to create an MPI derived type analogous to
struct soln_dynamic_t
int int_data[2];
double *u; /* Length constant for all instances of this struct */
Each processor then creates a view based on the derived type, and writes into the view.
I have this all working for the case in which *u is replaced with u[10] (see complete code below), but ultimately, I'd like to have a dynamic length array for u. (In case it matters, the length will be fixed for all instances of soln_dynamic_t for any run, but not known at compile time.)
What is the best way to handle this?
I have read several posts on why I can't use soln_dynamic_t
directly as an MPI structure. The problem is that processors are not guaranteed to have the same offset between u[0] and int_data[0]. (Is that right?)
On the other hand, the structure
struct soln_static_t
int int_data[2];
double u[10]; /* fixed at compile time */
works because the offsets are guaranteed to be the same across processors.
I've considered several approaches :
Create the view based on manually defined offsets, etc, rather than using a derived type.
Base the MPI structure on another MPI type, i.e. an contiguous type for ``*u` (is that allowed?)
I am guessing there must be a standard way to do this. Any suggestions would be very helpful.
Several other posts on this issue have been helpful, although they mostly deal with communication and not file I/O.
Here is the complete code::
#include <mpi.h>
typedef struct
int int_data[2];
double u[10]; /* Make this a dynamic length (but fixed) */
} soln_static_t;
void build_soln_type(int n, int* int_data, double *u, MPI_Datatype *soln_t)
int block_lengths[2] = {2,n};
MPI_Datatype typelist[2] = {MPI_INT, MPI_DOUBLE};
MPI_Aint disp[2], start_address, address;
disp[0] = 0;
disp[1] = address-start_address;
MPI_Datatype tmp_type;
MPI_Aint extent;
extent = block_lengths[0]*sizeof(int) + block_lengths[1]*sizeof(double);
MPI_Type_create_resized(tmp_type, 0, extent, soln_t);
void main(int argc, char** argv)
MPI_File file;
int globalsize, localsize, starts, order;
MPI_Datatype localarray, soln_t;
int rank, nprocs, nsize = 10; /* must match size in struct above */
/* --- Initialize MPI */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
/* --- Set up data to write out */
soln_static_t data;
data.int_data[0] = nsize;
data.int_data[1] = rank;
data.u[0] = 3.14159; /* To check that data is written as expected */
build_soln_type(nsize, data.int_data, data.u, &soln_t);
MPI_File_open(MPI_COMM_WORLD, "bin.out",
MPI_INFO_NULL, &file);
/* --- Create file view for this processor */
globalsize = nprocs;
localsize = 1;
starts = rank;
order = MPI_ORDER_C;
MPI_Type_create_subarray(1, &globalsize, &localsize, &starts, order,
soln_t, &localarray);
MPI_File_set_view(file, 0, soln_t, localarray,
"native", MPI_INFO_NULL);
/* --- Write data into view */
MPI_File_write_all(file, data.int_data, 1, soln_t, MPI_STATUS_IGNORE);
/* --- Clean up */
Since the size of the u array of the soln_dynamic_t type is known at runtime and will not change after that, I'd rather suggest an other approach.
Basically, you store all the data contiguous in memory :
typedef struct
int int_data[2];
double u[]; /* Make this a dynamic length (but fixed) */
} soln_dynamic_t;
Then you have to manually allocate this struct
soln_dynamic_t * alloc_soln(int nsize, int count) {
return (soln_dynamic_t *)calloc(offsetof(soln_dynamic_t, u)+nsize*sizeof(double), count);
Note you cannot directly access an array of soln_dynamic_t because the size is unknown at compile time. Instead, you have to manually calculate the pointers.
soln_dynamic_t *p = alloc_soln(10, 2);
p[0].int_data[0] = 1; // OK
p[0].u[0] = 2; // OK
p[1].int_data[0] = 3; // KO ! since sizeof(soln_dynamic_t) is unknown at compile time.
Here is the full rewritten version of your program
#include <mpi.h>
#include <malloc.h>
typedef struct
int int_data[2];
double u[]; /* Make this a dynamic length (but fixed) */
} soln_dynamic_t;
void build_soln_type(int n, MPI_Datatype *soln_t)
int block_lengths[2] = {2,n};
MPI_Datatype typelist[2] = {MPI_INT, MPI_DOUBLE};
MPI_Aint disp[2];
disp[0] = offsetof(soln_dynamic_t, int_data);
disp[1] = offsetof(soln_dynamic_t, u);
MPI_Datatype tmp_type;
MPI_Aint extent;
extent = offsetof(soln_dynamic_t, u) + block_lengths[1]*sizeof(double);
MPI_Type_create_resized(tmp_type, 0, extent, soln_t);
soln_dynamic_t * alloc_soln(int nsize, int count) {
return (soln_dynamic_t *)calloc(offsetof(soln_dynamic_t, u) + nsize*sizeof(double), count);
int main(int argc, char** argv)
MPI_File file;
int globalsize, localsize, starts, order;
MPI_Datatype localarray, soln_t;
int rank, nprocs, nsize = 10; /* must match size in struct above */
/* --- Initialize MPI */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
/* --- Set up data to write out */
soln_dynamic_t *data = alloc_soln(nsize,1);
data->int_data[0] = nsize;
data->int_data[1] = rank;
data->u[0] = 3.14159; /* To check that data is written as expected */
build_soln_type(nsize, &soln_t);
MPI_File_open(MPI_COMM_WORLD, "bin2.out",
MPI_INFO_NULL, &file);
/* --- Create file view for this processor */
globalsize = nprocs;
localsize = 1;
starts = rank;
order = MPI_ORDER_C;
MPI_Type_create_subarray(1, &globalsize, &localsize, &starts, order,
soln_t, &localarray);
MPI_File_set_view(file, 0, soln_t, localarray,
"native", MPI_INFO_NULL);
/* --- Write data into view */
MPI_File_write_all(file, data, 1, soln_t, MPI_STATUS_IGNORE);
/* --- Clean up */
return 0;

custom memory allocator - segfault

me and my friend are trying to develop custom memory allocator in linux ubuntu 16.04.
We got stuck because of an error, btw its our first time
that we are trying to code something like that so we are not the best debuggers the error is : Segmentation fault (core dumped)
and here is the code.
can anybody help us understand whats wrong ?
Thank you!
#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <stdio.h>
struct header_t {
size_t size;
unsigned is_free;
struct header_t *next; };
struct header_t *head = NULL, *tail = NULL;
pthread_mutex_t global_malloc_lock;
struct header_t *get_free_block(size_t size)
struct header_t *curr = head;
while(curr) {
/* see if there's a free block that can accomodate requested size */
if (curr->is_free && curr->size >= size)
return curr;
curr = curr->next;
return NULL;
void free(void *block)
struct header_t *header, *tmp;
/* program break is the end of the
process's data segment */
void *programbreak;
if (!block)
header = (struct header_t*)block - 1;
/* sbrk(0) gives the current program break address */
programbreak = sbrk(0);
Check if the block to be freed is the last one in the
linked list. If it is, then we could shrink the size of the
heap and release memory to OS. Else, we will keep the block
but mark it as free. */
if ((char*)block + header->size == programbreak) {
if (head == tail) {
head = tail = NULL;
} else {
tmp = head;
while (tmp) {
if(tmp->next == tail) {
tmp->next = NULL;
tail = tmp;
tmp = tmp->next;
/* sbrk() with a negative argument decrements the program break.
So memory is released by the program to OS. */
sbrk(0 - header->size - sizeof(struct header_t));
/* Note: This lock does not really assure thread
safety, because sbrk() itself is not really
thread safe. Suppose there occurs a foregin sbrk(N)
after we find the program break and before we decrement
it, then we end up realeasing the memory obtained by
the foreign sbrk(). */
header->is_free = 1;
void *malloc(size_t size)
size_t total_size;
void *block;
struct header_t *header;
if (!size)
return NULL;
header = get_free_block(size);
if (header) {
/* Woah, found a free block to accomodate requested memory. */
header->is_free = 0;
return (void*)(header + 1);
/* We need to get memory to fit in the requested block and header
from OS. */
total_size = sizeof(struct header_t) + size;
block = sbrk(total_size);
if (block == (void*) -1) {
return NULL;
header = block;
header->size = size;
header->is_free = 0;
header->next = NULL;
if (!head)
head = header;
if (tail)
tail->next = header;
tail = header;
return (void*)(header + 1);
void *calloc(size_t num, size_t nsize)
size_t size;
void *block;
if (!num || !nsize)
return NULL;
size = num * nsize;
/* check mul overflow */
if (nsize != size / num)
return NULL;
block = malloc(size);
if (!block)
return NULL;
memset(block, 0, size);
return block;
void *realloc(void *block, size_t size)
struct header_t *header;
void *ret;
if (!block || !size)
return malloc(size);
header = (struct header_t*)block - 1;
if (header->size >= size)
return block;
ret = malloc(size);
if (ret) {
/* Relocate contents to the new bigger block */
memcpy(ret, block, header->size);
/* Free the old memory block */
return ret;
The problem occurred because the functions were not prototyped [decalred].
Once I added functions prototype. The code worked.
For more information about prototyping:
mutex variable should be initialized before using it for applying lock. your global_malloc_lock is not initialized.
you can't initialize mutex variable as of normal variable.
pthread_mutex_t global_malloc_lock = 0 ;// invalid .. you may thinking since it's it declared as global it's initialized with 0 which is wrong
Initialize the mutex variable by calling pthread_mutex_init() or using PTHREAD_MUTEX_INITIALIZER ;
for your code add this
pthread_mutex_t global_malloc_lock = pthread_mutex_t global_malloc_lock;

how to implement splice_read for a character device file with uncached DMA buffer

I have a character device driver. It includes a 4MB coherent DMA buffer. The buffer is implemented as a ring buffer. I also implemente the splice_read call for the driver to improve the performance. But this implementation does not work well. Below is the using example:
(1)splice the 16 pages of device buffer data to a pipefd[1]. (the DMA buffer is managed as in page unit).
(2)splice the pipefd[0] to the socket.
(3)the receiving side (tcp client) receives the data, and then check the correctness.
I found that the tcp client got errors. The splice_read implementation is show below (I steal it from the vmsplice implementation):
/* splice related functions */
static void rdma_ring_pipe_buf_release(struct pipe_inode_info *pipe,
struct pipe_buffer *buf)
buf->flags &= ~PIPE_BUF_FLAG_LRU;
void rdma_ring_spd_release_page(struct splice_pipe_desc *spd, unsigned int i)
static const struct pipe_buf_operations rdma_ring_page_pipe_buf_ops = {
.can_merge = 0,
.map = generic_pipe_buf_map,
.unmap = generic_pipe_buf_unmap,
.confirm = generic_pipe_buf_confirm,
.release = rdma_ring_pipe_buf_release,
.steal = generic_pipe_buf_steal,
.get = generic_pipe_buf_get,
/* in order to simplify the caller work, the parameter meanings of ppos, len
* has been changed to adapt the internal ring buffer of the driver. The ppos
* indicate wich page is refferred(shoud start from 1, as the csr page are
* not allowed to do the splice), The len indicate how many pages are needed.
* Also, we constrain that maximum page number for each splice shoud not
* exceed 16 pages, if else, a EINVAL will return. If a high speed device
* need a more big page number, it can rework this routing. The off is also
* used to return the total bytes shoud be transferred, use can compare it
* with the return value to determint whether all bytes has been transfered.
static ssize_t do_rdma_ring_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len,
unsigned int flags)
struct rdma_ring *priv = to_rdma_ring(in->private_data);
struct rdma_ring_buf *data_buf;
struct rdma_ring_dstatus *dsta_buf;
struct page *pages[PIPE_DEF_BUFFERS];
struct partial_page partial[PIPE_DEF_BUFFERS];
ssize_t total_sz = 0, error;
int i;
unsigned offset;
struct splice_pipe_desc spd = {
.pages = pages,
.partial = partial,
.nr_pages_max = PIPE_DEF_BUFFERS,
.flags = flags,
.ops = &rdma_ring_page_pipe_buf_ops,
.spd_release = rdma_ring_spd_release_page,
/* init the spd, currently we omit the packet header, if a control
* is needed, it may be implemented by define a control variable in
* the device struct */
spd.nr_pages = len;
for (i = 0; i < len; i++) {
offset = (unsigned)(*ppos) + i;
data_buf = get_buf(priv, offset);
dsta_buf = get_dsta_buf(priv, offset);
pages[i] = virt_to_page(data_buf);
partial[i].offset = 0;
partial[i].len = dsta_buf->bytes_xferred;
total_sz += partial[i].len;
error = _splice_to_pipe(pipe, &spd);
/* use the ppos to return the theory total bytes shoud transfer */
*ppos = total_sz;
return error;
/* splice read */
static ssize_t rdma_ring_splice_read(struct file *in, loff_t *ppos,
struct pipe_inode_info *pipe, size_t len, unsigned int flags)
ssize_t ret;
MY_PRINT("%s: *ppos = %lld, len = %ld\n", __func__, *ppos, (long)len);
if (unlikely(len > PIPE_DEF_BUFFERS))
return -EINVAL;
ret = do_rdma_ring_splice_read(in, ppos, pipe, len, flags);
return ret;
The _splice_to_pipe is just the same one as the splice_to_pipe in kernel. As this function is not an exported symbol, so I re-implemented it.
I think the main cause is that the some kind of lock of pages are omitted, but
I don't know where and how.
My kernel version is 3.10.

cudaMemcpyAsync only one member of a struct from device to host

I have a struct with multiple members and I want to do some operations on parts of the member by GPUs. To make the size of communication as small as possible, I hope to copy back only those members which have been modified. Can cuda do that?
struct nodeInfo;
typedef struct nodeInfo
int x;
int y;
int main(int argc, char* argv[]){
int ngpus;
cudaStream_t stream[ngpus];
nodeProp *Nodes;
nodeProp *gpuNodes[ngpus];
int rankSize = 10;
int deviceSize = rankSize/ngpus;
for(int i = 0; i < ngpus; i++)
for(int i = 0; i < ngpus; i++)
kernel_x_Operation<<<grid_size,block_size,0,stream[i]>>>(gpuNodes[i]);//Some operation on gpuNodes.x
//How to write the memcpy function? Can I just copy one member of the struct back?
CHECK((void*)cudaMemcpyAsync((Nodes+i*deviceSize)->x, gpuNodes[i]->x), sizeof(int)*deviceSize,cudaMemcpyDeviceToHost,stream[i]));
No, you can't do that. But you can achieve something similar by laying your data out as a Struct of Arrays instead of an Array of Structs.
Have a look at Structure of Arrays vs Array of Structures in cuda to see how this might even improve performance.

CUDA copy linked lists from device to host

I am trying to populate a number of linked lists on the device and then return those lists back to the hosts.
From my understanding I need to allocate memory for my struct Element, but I don't know how to go about it since I will have many linked lists, each with an unknown number of elements. I've tried a couple of different things but it still didn't work. So I'm back to the starting point. Here is my code:
class Node{
int x,y;
Node *parent;
__device__ __host__ Node(){}
__device__ __host__ Node(int cX, int cY){x = cX; y = cY;}
__device__ __host__ int get_row() { return x; }
__device__ __host__ int get_col() { return y; }
class LinkedList{
__device__ __host__ struct Element{
Node n1;
Element *next;
__device__ __host__ LinkedList(){
head = NULL;
__device__ __host__ void addNode(Node n){
Element *el = new Element();
el->n1 = n;
el->next = head;
head = el;
__device__ __host__ Node popFirstNode(){
Element *cur = head;
Node n;
if(cur != NULL){
n = cur -> n1;
head = head -> next;
delete cur;
return n;
__device__ __host__ bool isEmpty(){
Element *cur = head;
if(cur == NULL){
return true;
return false;
Element *head;
__global__ void listsKernel(LinkedList* d_Results, int numLists){
int idx = blockIdx.x * blockDim.x + threadIdx.x;
Node n(1,1);
if(idx < numLists){
int main(){
int numLists = 10;
size_t size = numLists * sizeof(LinkedList);
LinkedList curList;
LinkedList* h_Results = (LinkedList*)malloc(size);
LinkedList* d_Results;
cudaMalloc((void**)&d_Results, size);
listsKernel<<<256,256>>>(d_Results, numLists);
cudaMemcpy(h_Results, d_Results, sizeof(LinkedList)*numLists, cudaMemcpyDeviceToHost);
for(int i = 0; i < numLists; i++){
curList = h_Results[i];
while(curList.isEmpty() == false){
Node n = curList.popFirstNode();
std::cout << "x: " << n.get_row() << " y: " << n.get_col();
As you can see I'm trying to populate 10 linked lists on the device and then return them back to the host, but the code above results in unhandled exception - Access violation reading location. I am assuming it is not coping the pointers from the device.
Any help would be great.
Just eyeballing the code, it seems you have a fundamental misconception: there is host memory which cannot be accessed from the device, and device memory which cannot be accessed from the host. So when you create linked list nodes in device memory and copy the pointers back to the host, the host cannot dereference those pointers, because they are pointing to device memory.
If you truly want to pass linked lists back and forth between host and device, your best bet is probably to copy the entries into an array, do the memcpy, then copy the array back into a linked list. Other things can be done too, depending on just what your use case is.
(it is possible to allocate a region of memory that is accessible both from the host and from the device, but there is some awkwardness with it and I have no experience using it)
