NtQueryObject returns wrong insufficient required size via WOW64, why?

NtQueryObject returns wrong insufficient required size via WOW64, why? - windows-10

I am using the NT native API NtQueryObject()/ZwQueryObject() from user mode (and I am aware of the risks in general and I have written kernel mode drivers for Windows in the past in my professional capacity).
Generally when one uses the typical "query information" function (of which there are a few) the protocol is first to ask with a too small buffer to retrieve the required size with STATUS_INFO_LENGTH_MISMATCH, then allocate a buffer of said size and query again -- this time using the buffer and previously returned size.
In order to get the list of object types (67 on my build) on the system I am doing just that:
ULONG Size = 0;
NTSTATUS Status = NtQueryObject(NULL, ObjectTypesInformation, &Size, sizeof(Size), &Size);
And in Size I get 8280 (WOW64) and 8968 (x64). I then proceed to allocate the buffer with calloc() and query again:
ULONG Size2 = 0;
BYTE* Buf = (BYTE*)::calloc(1, Size);
Status = NtQueryObject(NULL, ObjectTypesInformation, Buf, Size, &Size2);
NB: ObjectTypesInformation is 3. It isn't declared in winternl.h, but Nebbett (as ObjectAllTypesInformation) and others describe it. Since I am not querying for a particular object's traits but the system-wide list of object types, I pass NULL for the object handle.
Curiously on WOW64, i.e. 32-bit, the value in Size2 upon return from the second query is 16 Bytes (= 8296) bigger than the previously returned required size.
As far as alignment is concerned, I'd expect at most 8 Bytes for this sort of thing and indeed neither 8280 nor 8296 are at a 16 Byte alignment boundary, but on an 8 Byte one.
Certainly I can add some slack space on top of the returned required size (e.g. ALIGN_UP to the next 32 Byte alignment boundary), but this seems highly irregular to be honest. And I'd rather want to understand what's going on than to implement a workaround that breaks, because I miss something crucial.
The practical issue for the code is that in Debug configurations it tells me there's a corrupted heap somewhere, upon freeing Buf. Which suggests that NtQueryObject() was indeed writing these extra 16 Bytes beyond the buffer I provided.
Question: Any idea why it is doing that?
As usual for NT native API the sources of information are scarce. The x64 version of the exact same code returns the exact number of bytes required. So my thinking here is that WOW64 is the issue. A somewhat cursory look into wow64.dll with IDA didn't reveal any immediate points for suspicion regarding what goes wrong in translating the results to 32-bit here.
PS: Windows 10 (10.0.19043, ntdll.dll "timestamp" 77755782)
PPS: this may be related: https://wj32.org/wp/2012/11/30/obquerytypeinfo-and-ntqueryobject-buffer-overrun-in-windows-8/ Tested it, by checking that OBJECT_TYPE_INFORMATION::TypeName.Length + sizeof(WCHAR) == OBJECT_TYPE_INFORMATION::TypeName.MaximumLength in all returned items, which was the case.

The only part of ObjectTypesInformation that's public is the first field defined in winternl.h header in the Windows SDK:
typedef struct __PUBLIC_OBJECT_TYPE_INFORMATION {
UNICODE_STRING TypeName;
ULONG Reserved [22]; // reserved for internal use
} PUBLIC_OBJECT_TYPE_INFORMATION, *PPUBLIC_OBJECT_TYPE_INFORMATION;
For x86 this is 96 bytes, and for x64 this is 104 bytes (assuming you have the right packing mode enabled). The difference is the pointer in UNICODE_STRING which changes the alignment in x64.
Any additional memory space should be related to the TypeName buffer.
UNICODE_STRING accounts for 8 bytes of the difference between 8280 and 8296. The function uses the sizeof(ULONG_PTR) for alignment of the returned string plus an extra WCHAR, so that could easily account for the remaining 8 bytes.
AFAIK: The public use of NtQueryObject is supposed to be limited to kernel-mode use which of course means it always matches the OS native bitness (x86 code can't run as kernel in x64 native OS), so it's probably just a quirk of using the NT functions via the WOW64 thunk.

Alright, I think I figured out the issue with the help of WinDbg and a thorough look at wow64.dll using IDA.
NB: the wow64.dll I have has the same build number, but differs slightly in data only (checksum, security directory entry, pieces from version resources). The code is identical, which was to be expected, given deterministic builds and how they affect the PE timestamp.
There's an internal function called whNtQueryObject_SpecialQueryCase (according to PDBs), which covers the ObjectTypesInformation class queries.
For the above wow64.dll I used the following points of interest in WinDbg, from a 32 bit program which calls NtQueryObject(NULL, ObjectTypesInformation, ...) (the program itself is irrelevant, though):
0:000> .load wow64exts
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B0E0
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B14E
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B1A7
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B24A
0:000> bp wow64!whNtQueryObject_SpecialQueryCase+B252
Explanation of the above points of interest:
+B0E0: computing length required for 64 bit query, based on passed length for 32 bit
+B14E: call to NtQueryObject()
+B1A7: loop body for copying 64 to 32 bit buffer contents, after successful NtQueryObject() call
+B24A: computing written length by subtracting current (last + 1) entry from base buffer address
+B252: downsizing returned (64 bit) required length to 32 bit
The logic of this function in regards to just ObjectTypesInformation is roughly as follows:
Common steps
Take the ObjectInformationLength (32 bit query!) argument and size it up to fit the 64 bit info
Align the retrieved size up to the next 16 byte boundary
If necessary allocate the resulting amount from some PEB::ProcessHeap and store in TLS slot 3; otherwise using this as a scratch space
Call NtQueryObject() passing the buffer and length from the two previous steps
The length passed to NtQueryObject() is the one from step 1, not the one aligned to a 16 byte boundary. There seems to be some sort of header to this scratch space, so perhaps that's where the 16 byte alignment comes from?
Case 1: buffer size too small (here: 4), just querying required length
The up-sized length in this case equals 4, which is too small and consequently NtQueryObject() returns STATUS_INFO_LENGTH_MISMATCH. Required size is reported as 8968.
Down-size from the 64 bit required length to 32 bit and end up 16 bytes too short
Return the status from NtQueryObject() and the down-sized required length form the previous step
Case 2: buffer size supposedly (!) sufficient
Copy OBJECT_TYPES_INFORMATION::NumberOfTypes from queried buffer to 32 bit one
Step to the first entry (OBJECT_TYPE_INFORMATION) of source (64 bit) and target (32 bit) buffer, 8 and 4 byte aligned respectively
For for each entry up to OBJECT_TYPES_INFORMATION::NumberOfTypes:
Copy UNICODE_STRING::Length and UNICODE_STRING::MaximumLength for TypeName member
memcpy() UNICODE_STRING::Length bytes from source to target UNICODE_STRING::Buffer (target entry + sizeof(OBJECT_TYPE_INFORMATION32)
Add terminating zero (WCHAR) past the memcpy'd string
Copy the individual members past the TypeName from 64 to 32 bit struct
Compute pointer of next entry by aligning UNICODE_STRING::MaximumLength up to an 8 byte boundary (i.e. the ULONG_PTR alignment mentioned in the other answer) + sizeof(OBJECT_TYPE_INFORMATION64) (already 8 byte aligned!)
The next target entry (32 bit) gets 4 byte aligned instead
At the end compute required (32 bit) length by subtracting the value we arrived at for the "next" entry (i.e. one past the last) from the base address of the buffer passed by the WOW64 program (32 bit) to NtQueryObject()
In my debugged scenario these were: 0x008ce050 - 0x008cbfe8 = 0x00002068 (= 8296), which is 16 bytes larger than the buffer length we were told during case 1 (8280)!
The issue
That crucial last step differs between merely querying and actually getting the buffer filled. There is no further bounds checking in that loop I described for case 2.
And this means it will just overrun the passed buffer and return a written length bigger than the buffer length passed to it.
Possible solutions and workarounds
I'll have to approach this mathematically after some sleep, the workaround is obviously to top up the required length returned from case 1 in order to avoid the buffer overrun. The easiest method is to use my up_size_from_32bit() from the example below and use that on the returned required size. This way you are allocating enough for the 64 bit buffer, while querying the 32 bit one. This should never overrun during the copy loop.
However, the fix in wow64.dll is a little more involved, I guess. While adding bounds checking to the loop would help avert the overrun, it would mean that the caller would have to query for the required size twice, because the first time around it lies to us.
Which means the query-only case (1) would have to allocate that internal buffer after querying the required length for 64 bit, then get it filled and then walk the entries (just like the copy loop), skipping over the last entry to compute the required length the same as it is now done after the copy loop.
Example program demonstrating the "static" computation by wow64.dll
Build for x64, just the way wow64.dll was!
#define WIN32_LEAN_AND_MEAN
#include <Windows.h>
#include <cstdio>
typedef struct
{
ULONG JustPretending[24];
} OBJECT_TYPE_INFORMATION32;
typedef struct
{
ULONG JustPretending[26];
} OBJECT_TYPE_INFORMATION64;
constexpr ULONG size_delta_3264 = sizeof(OBJECT_TYPE_INFORMATION64) - sizeof(OBJECT_TYPE_INFORMATION32);
constexpr ULONG down_size_to_32bit(ULONG len)
{
return len - size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION64));
}
constexpr ULONG up_size_from_32bit(ULONG len)
{
return len + size_delta_3264 * ((len - 4) / sizeof(OBJECT_TYPE_INFORMATION32));
}
// Trying to mimic the wdm.h macro
constexpr size_t align_up_by(size_t address, size_t alignment)
{
return (address + (alignment - 1)) & ~(alignment - 1);
}
constexpr auto u32 = 8280UL;
constexpr auto u64 = 8968UL;
constexpr auto from_64 = down_size_to_32bit(u64);
constexpr auto from_32 = up_size_from_32bit(u32);
constexpr auto from_32_16_byte_aligned = (ULONG)align_up_by(from_32, 16);
int wmain()
{
wprintf(L"32 to 64 bit: %u -> %u -(16-byte-align)-> %u\n", u32, from_32, from_32_16_byte_aligned);
wprintf(L"64 to 32 bit: %u -> %u\n", u64, from_64);
return 0;
}
static_assert(sizeof(OBJECT_TYPE_INFORMATION32) == 96, "Size for 64 bit struct does not match.");
static_assert(sizeof(OBJECT_TYPE_INFORMATION64) == 104, "Size for 64 bit struct does not match.");
static_assert(u32 == from_64, "Must match (from 64 to 32 bit)");
static_assert(u64 == from_32, "Must match (from 32 to 64 bit)");
static_assert(from_32_16_byte_aligned % 16 == 0, "16 byte alignment failed");
static_assert(from_32_16_byte_aligned > from_32, "We're aligning up");
This does not mimic the computation that happens in case 2, though.

Related

How to get the location in the compressed data - zlib

When I save the 'state of uncompression', I also need to save:
"location in the compressed data, which is both a byte offset and bit offset within that byte".
After a reboot, along with inflateSetDictionary(), I call inflatePrime() as below, "to feed the bits from the byte at the compressed data offset".
inflatePrime ( , streamBits, streamCurrentPos)
Both APIs return Z_OK, but params to inflatePrime(), I am bit uncertain.
This is how I gathered them:
typedef struct state_of_uncompression
{
uInt streamCurrentPos; // Missing this, tried the output from unzGetCurrentFileZStreamPos64()
int streamBits; // from : stream.data_type, after clearing bits 8,7,6: stream.data_type & (~0x1C0)
Byte dictionary_buf[32768]; // from : inflateGetDictionary()
uInt dictLength; // from : inflateGetDictionary();
uint64_t output_wrt_offset // got this already.
} uncompression_state_info;
So after the reboot, the plan is to recontinue the uncompression, but inflate() returns Z_STREAM_END inside unzReadCurrentFile(), as if, inflate() doesn't know where to restart from.
Thanks appreciate any feedback.

The third argument to inflatePrime() is not a position. It is the actual bits to insert, which you need to get from the compressed data. You use fseek() or lseek() to go to the byte offset in the file, where you saved that offset as part of your entry point information. You get that byte, which advances the file pointer to the next byte, and shift down by the number of bits you are not providing, i.e. 8 minus the second argument. That's the third argument. The second argument is always in 1..7. If there are no bits to insert, then you don't call inflatePrime(), and just leave the file pointer where it is to begin inflating.
The position in your state should be a 64-bit value, not a 32-bit value as you currently have it.

Different memory allocations in linux and windows?

I have a tree (T*Tree: binary tree with many elements in the node) implemented in C++.
I want to insert around 5,000,000 integer values in it (let's say from 1 till 5,000,000). The tree size should be around 8 * 5,000,000 byte or 41MB in memory (according to my implementation which is reasonable).
When I display the size of the tree(in my program by calculating the size of every node), it is 41MB as normal. However when I checked in Windows 32bit>>"Task Manager" I found the memory taken is 732MB!!
I checked that there is no extra malloc in my code. Even after I freed the tree by traversing from node to node and deleting them(and the keys inside) the size in "Task Manager" becomes 513MB only!!
After that I compiled same code in Linux Ubuntu 32bit(virtual machine on another PC) and ran the program. Again tree size does not change in my program i.e. 41MB as normal, but in "System Monitor" memory is 230MB and when freeing the tree nodes in my program the memory in "System Monitor" remains same 230MB.
And in both Windows and Linux if I freed & reinitialized the tree and insert again 5,000,0000 integer values, the memory is increased by double like if the previous space is not freed and used somewhere (which I am not able to find where).
The question:
1) why are those huge memory differences in Windows and Linux although the code & input data is same?
2) why freeing the Tree nodes doesn't reduce the memory to some reasonable value like 10MB.
code: https://drive.google.com/open?id=0ByKaCojxzNa9dEt6cEJNeDI4eXc
below are some snippets:
typedef struct Keylist {
unsigned int k;
struct Keylist *next_ptr;
};
typedef struct Keylist Keylist;
typedef struct TstarTreeNode {
//Binary Node specific
struct TstarTreeNode *left;
struct TstarTreeNode *right;
//Bool rightVisitedDuringInsert;
//AVL Node specific
int height;
//T Node specific
int length; //length of keys array for easy locating
struct Keylist *keys; //later you deal with it like one dimentional array
int max; //max key
int min; //min key
//T* Node specific
struct TstarTreeNode *successor;
};
typedef struct TstarTreeNode TstarTreeNode;
/*****************************************************************************
* *
* Define a structure for binary trees. *
* *
*****************************************************************************/
typedef struct TstarTree {
int size; //number of element(not number of nodes) in a tree
int MinCount; //Min Count of elements in a Node
int MaxCount; //Max Count of elements in a Node
TstarTreeNode *root;
//Provide functions for comarison elements and destroying elements
int (*compare)(int key1, int key2); //// -1 smaller, 0 equal, 1 bigger
int (*inRange)(int key, int min, int max); // -1 smaller, 0 in range, 1 bigger
} ;
typedef struct TstarTree TstarTree;
Insert function of the tree uses dynamic allocation i.e. malloc.
Update
according to what "John Zwinck" pointed out (thanks John), I have two things now:
1) The huge memory taken in Windows was because of the compiling options in Visual Studio, which I think enabled debugging and a lot of extra things. When I compiled in Windows using Cygwin without that options i.e. "gcc main.c tstarTree.c -o main" I got same result as in Linux. The size now in Windows>>"Task Manager" becomes 230MB
2) If OS is 64bit then let's see how the size is calculated (as John said and as I modified):
5 million unsigned int k. 20 MB.
5 million 4-byte pads (after k to align next_ptr). 20 MB.
5 million 8-byte next_ptr. 40 MB.
5 million times the overhead of malloc(). I think for 64bit OS it is 32 bytes each (according to John provided link). so 160 MB.
N TstarTreeNodes, each of which is 48 bytes in the full code.
N times the overhead of malloc() (I think, 32 bytes each).
N is the number of nodes. I have a resulting balanced complete tree of height 16 so I assume the number of nodes are 2^17-1. so the last two items become 6.2MB(i.e. 2^17 * 48) + 4.1MB(i.e. 2^17 * 32) =10MB
So the total is: 20+20+40+160+10= 250MB which is somehow reasonable and close to 230MB.
However I have Windows/Linux 32bit it will be (I think):
5 million unsigned int k. 20 MB.
5 million 4-byte next_ptr. 20 MB.
5 million times the overhead of malloc(). I think for 32bit OS it is 16 bytes each. so 80 MB.
N TstarTreeNodes, each of which is 32 bytes in the full code.
N times the overhead of malloc() (I think, 16 bytes each).
N is the number of nodes. I have a resulting balanced complete tree of height 16 so I assume the number of nodes are 2^17-1. so the last two items become 4.1MB(i.e. 2^17 * 32) + 2MB(i.e. 2^17 * 16) =6MB
So the total is: 20+20+80+6= 126MB it is a little far from 230MB which I get in "Task Manager" (if you know why please tell me?)
Currently the remaining important question is, why isn't the tree freed from memory when I am freeing all the nodes and keys in the tree using this code:
void freekeys(struct Keylist ** keys){
if ((*keys) == NULL)
{
return;
}
freekeys(&(*keys)->next_ptr);
(*keys)->next_ptr = NULL;
free((*keys));
(*keys) = NULL;
}
void freeTree(struct TstarTreeNode ** tree){
if ((*tree) == NULL)
{
return;
}
freeTree(&(*tree)->left);
freeTree(&(*tree)->right);
freekeys(&(*tree)->keys);
(*tree)->keys = NULL;
(*tree)->left = NULL;
(*tree)->right = NULL;
(*tree)->successor = NULL;
free((*tree));
(*tree) = NULL;
}
and in main():
TstarTree * tree;
...
freeTree(&tree->root);
free(tree);
Note:
The tree is working perfectly (insert, update, delete, lookup, display...) but when trying to free the tree from memory nothing changed in its size

You say your data takes:
8 * 5,000,000 byte or 41MB in memory
But that is not correct. Looking at your code there are two main structures:
struct Keylist {
unsigned int k;
Keylist *next_ptr;
};
struct TstarTreeNode {
TstarTreeNode *left, *right;
Keylist *keys;
TstarTreeNode *successor;
};
Let's say we have 5 million integers to store, as in your example. What will we need?
5 million unsigned int k. 20 MB.
5 million 4-byte pads (after k to align next_ptr). 20 MB.
5 million 8-byte next_ptr. 40 MB.
5 million times the overhead of malloc(). Likely 16 bytes each. 80 MB.
N TstarTreeNodes, each of which is 48 bytes in the full code.
N times the overhead of malloc() (again, 16 bytes each).
If N is 500,000 (for example, I don't know the real value but you do), those last two items add up to 32 MB. That brings the total to at least 192 MB as a bare minimum. Therefore, seeing 230 MB of memory usage in Linux is not surprising.
Some systems, especially when optimization is not fully enabled at build time, will add more bookkeeping and debugging information to each block allocated with malloc(). Are you building with optimization fully enabled?
One way you can save a lot of overhead is to stop using Keylist and just store the integers in plain arrays (created with malloc(), but only one per TstarTreeNode).

Atomicity of a read on SPARC

I'm writing a multithreaded application and having a problem on the SPARC platform. Ultimately my question comes down to atomicity of this platform and how I could be obtaining this result.
Some pseudocode to help clarify my question:
// Global variable
typdef struct pkd_struct{
uint16_t a;
uint16_t b;
} __attribute__(packed) pkd_struct_t;
pkd_struct_t shared;
Thread 1:
swap_value() {
pkd_struct_t prev = shared;
printf("%d%d\n", prev.a, prev.b);
...
}
Thread 2:
use_value() {
pkd_struct_t next;
next.a = 0; next.b = 0;
shared = next;
printf("%d%d\n", shared.a, shared.b);
...
}
Thread 1 and 2 are accessing the shared variable "shared". One is setting, the other is getting. If Thread 2 is setting "shared" to zero, I'd expect Thread 1 to read count either before OR after the setting -- since "shared" is aligned on a 4-byte boundary. However, I will occasionally see Thread 1 reading the value of the form 0xFFFFFF00. That is the high-order 24 bits are OLD, but the low-order byte is NEW. It appears I've gotten an intermediate value.
Looking at the disassembly, the use_value function simply does an "ST" instruction. Given that the data is aligned and isn't crossing a word boundary, is there any explanation for this behavior? If ST is indeed NOT atomic to use this way, does this explain the result I see (only 1 byte changed?!?)? There is no problem on x86.
UPDATE 1:
I've found the problem, but not the cause. GCC appears to be generating assembly that reads the shared variably byte-by-byte (thus allowing a partial update to be obtained). Comments added, but I am not terribly comfortable with SPARC assembly. %i0 is a pointer to the shared variable.
xxx+0xc: ldub [%i0], %g1 // ld unsigned byte g1 = [i0] -- 0 padded
xxx+0x10: ...
xxx+0x14: ldub [%i0 + 0x1], %g5 // ld unsigned byte g5 = [i0+1] -- 0 padded
xxx+0x18: sllx %g1, 0x18, %g1 // g1 = [i0+0] left shifted by 24
xxx+0x1c: ldub [%i0 + 0x2], %g4 // ld unsigned byte g4 = [i0+2] -- 0 padded
xxx+0x20: sllx %g5, 0x10, %g5 // g5 = [i0+1] left shifted by 16
xxx+0x24: or %g5, %g1, %g5 // g5 = g5 OR g1
xxx+0x28: sllx %g4, 0x8, %g4 // g4 = [i0+2] left shifted by 8
xxx+0x2c: or %g4, %g5, %g4 // g4 = g4 OR g5
xxx+0x30: ldub [%i0 + 0x3], %g1 // ld unsigned byte g1 = [i0+3] -- 0 padded
xxx+0x34: or %g1, %g4, %g1 // g1 = g4 OR g1
xxx+0x38: ...
xxx+0x3c: st %g1, [%fp + 0x7df] // store g1 on the stack
Any idea why GCC is generating code like this?
UPDATE 2: Adding more info to the example code. Appologies -- I'm working with a mix of new and legacy code and it's difficult to separate what's relevant. Also, I understand sharing a variable like this is highly-discouraged in general. However, this is actually in a lock implementation where higher-level code will be using this to provide atomicity and using pthreads or platform-specific locking is not an option for this.

Because you've declared the type as packed, it gets one byte alignment, which means it must be read and written one byte at a time, as SPARC does not allow unaligned loads/stores. You need to give it 4-byte alignment if you want the compiler to use word load/store instructions:
typdef struct pkd_struct {
uint16_t a;
uint16_t b;
} __attribute__((packed, aligned(4))) pkd_struct_t;
Note that packed is essentially meaningless for this struct, so you could leave that out.

Answering my own question here -- this has bugged me for too long and hopefully I can save someone a bit of frustration at some point.
The problem is that although the shared data is aligned, because it is packed GCC reads it byte-by-byte.
There is some discussion here on how packing leading to load/store bloat on SPARC (and other RISC platforms I'd assume...), but in my case it has lead to a race.

What is openCL equivalent for this cuda "cudaMallocPitch "code.?

My PC has an AMD processor with an ATI 3200 GPU which doesn't support OpenCL. The rest of the codes all running by "Falling back to CPU itself".
I am converting one of the code from CUDA to OpenCL but stuck in some particular part for which there is no exact conversion code in OpenCL. since i have less experience in OpenCL I can't make out this, please suggest me some solution if any of you think will work,
The CUDA code is,
size_t pitch = 0;
cudaError error = cudaMallocPitch((void**)&gpu_data, (size_t*)&pitch,
instances->cols * sizeof(float), instances->rows);
for( int i = 0; i < instances->rows; i++ ){
error = cudaMemcpy((void*)(gpu_data + (pitch/sizeof(float))*i),
(void*)(instances->data + (instances->cols*i)),
instances->cols * sizeof(float) ,cudaMemcpyHostToDevice);
If I remove the pitch value from the above I end up with an problem which doesn't write to the device memory "gpu_data".
Somebody please convert this code to OpenCL and reply. I have converted it to OpenCL, but its not working and the data is not written to "gpu_data". My converted OpenCL code is
gpu_data = clCreateBuffer(context, CL_MEM_READ_WRITE, ((instances->cols)*(instances->rows))*sizeof(float), NULL, &ret);
for( int i = 0; i < instances->rows; i++ ){
ret = clEnqueueWriteBuffer(command_queue, gpu_data, CL_TRUE, 0, ((instances->cols)*(instances->rows))*sizeof(float),(void*)(instances->data + (instances->cols*i)) , 0, NULL, NULL);
Sometimes it runs well for this code and gets stuck in the reading part i.e.
ret = clEnqueueReadBuffer(command_queue, gpu_data, CL_TRUE, 0,sizeof( float ) * instances->cols* 1 , instances->data, 0, NULL, NULL);
overhere. And it gives error like
Unhandled exception at 0x10001098 in CL_kmeans.exe: 0xC000001D: Illegal Instruction.
when break is pressed , it gives:
No symbols are loaded for any call stack frame. The source code cannot be displayed.
while debugging. In the call stack it is displaying:
OCL8CA9.tmp.dll!10001098()
[Frames below may be incorrect and/or missing, no symbols loaded for OCL8CA9.tmp.dll]
amdocl.dll!5c39de16()
I really dont know what it means. someone please help me to rid of this problem.

First of all, in the CUDA code you're doing a horribly inefficient thing to copy the data. The CUDA runtime has the function cudaMemcpy2D that does exactly what you are trying to do by looping over different rows.
What cudaMallocPitch does is to compute an optimal pitch (= distance in byte between rows in a 2D array) such that each new row begins at an address that is optimal for coalescing, and then allocates a memory area as large as pitch times the number of rows you specify. You can emulate the same thing in OpenCL by first computing the optimal pitch and then doing the allocation of the correct size.
The optimal pitch is computed by (1) getting the base address alignment preference for your card (CL_DEVICE_MEM_BASE_ADDR_ALIGN property with clGetDeviceInfo: note that the returned value is in bits, so you have to divide by 8 to get it in bytes); let's call this base (2) find the largest multiple of base that is no less than your natural data pitch (sizeof(type) times number of columns); this will be your pitch.
You then allocate pitch times number of rows bytes, and pass the pitch information to kernels.
Also, when copying data from the host to the device and converesely, you want to use clEnqueue{Read,Write}BufferRect, that are specifically designed to copy 2D data (they are the counterparts to cudaMemcpy2D).

Is it possible to store pointers in shared memory without using offsets?

When using shared memory, each process may mmap the shared region into a different area of its respective address space. This means that when storing pointers within the shared region, you need to store them as offsets of the start of the shared region. Unfortunately, this complicates use of atomic instructions (e.g. if you're trying to write a lock free algorithm). For example, say you have a bunch of reference counted nodes in shared memory, created by a single writer. The writer periodically atomically updates a pointer 'p' to point to a valid node with positive reference count. Readers want to atomically write to 'p' because it points to the beginning of a node (a struct) whose first element is a reference count. Since p always points to a valid node, incrementing the ref count is safe, and makes it safe to dereference 'p' and access other members. However, this all only works when everything is in the same address space. If the nodes and the 'p' pointer are stored in shared memory, then clients suffer a race condition:
x = read p
y = x + offset
Increment refcount at y
During step 2, p may change and x may no longer point to a valid node. The only workaround I can think of is somehow forcing all processes to agree on where to map the shared memory, so that real pointers rather than offsets can be stored in the mmap'd region. Is there any way to do that? I see MAP_FIXED in the mmap documentation, but I don't know how I could pick an address that would be safe.
Edit: Using inline assembly and the 'lock' prefix on x86 maybe it's possible to build a "increment ptr X with offset Y by value Z"? Equivalent options on other architectures? Haven't written a lot of assembly, don't know if the needed instructions exist.

On low level the x86 atomic inctruction can do all this tree steps at once:
x = read p
y = x + offset Increment
refcount at y
//
mov edi, Destination
mov edx, DataOffset
mov ecx, NewData
#Repeat:
mov eax, [edi + edx] //load OldData
//Here you can also increment eax and save to [edi + edx]
lock cmpxchg dword ptr [edi + edx], ecx
jnz #Repeat
//

This is trivial on a UNIX system; just use the shared memory functions:
shgmet, shmat, shmctl, shmdt
void *shmat(int shmid, const void *shmaddr, int shmflg);
shmat() attaches the shared memory
segment identified by shmid to the
address space of the calling process.
The attaching address is specified by
shmaddr with one of the following
criteria:
If shmaddr is NULL, the system chooses
a suitable (unused) address at which
to attach the segment.
Just specify your own address here; e.g. 0x20000000000
If you shmget() using the same key and size in every process, you will get the same shared memory segment. If you shmat() at the same address, the virtual addresses will be the same in all processes. The kernel doesn't care what address range you use, as long as it doesn't conflict with wherever it normally assigns things. (If you leave out the address, you can see the general region that it likes to put things; also, check addresses on the stack and returned from malloc() / new[] .)
On Linux, make sure root sets SHMMAX in /proc/sys/kernel/shmmax to a large enough number to accommodate your shared memory segments (default is 32MB).
As for atomic operations, you can get them all from the Linux kernel source, e.g.
include/asm-x86/atomic_64.h
/*
* Make sure gcc doesn't try to be clever and move things around
* on us. We need to use _exactly_ the address the user gave us,
* not some alias that contains the same information.
*/
typedef struct {
int counter;
} atomic_t;
/**
* atomic_read - read atomic variable
* #v: pointer of type atomic_t
*
* Atomically reads the value of #v.
*/
#define atomic_read(v) ((v)->counter)
/**
* atomic_set - set atomic variable
* #v: pointer of type atomic_t
* #i: required value
*
* Atomically sets the value of #v to #i.
*/
#define atomic_set(v, i) (((v)->counter) = (i))
/**
* atomic_add - add integer to atomic variable
* #i: integer value to add
* #v: pointer of type atomic_t
*
* Atomically adds #i to #v.
*/
static inline void atomic_add(int i, atomic_t *v)
{
asm volatile(LOCK_PREFIX "addl %1,%0"
: "=m" (v->counter)
: "ir" (i), "m" (v->counter));
}
64-bit version:
typedef struct {
long counter;
} atomic64_t;
/**
* atomic64_add - add integer to atomic64 variable
* #i: integer value to add
* #v: pointer to type atomic64_t
*
* Atomically adds #i to #v.
*/
static inline void atomic64_add(long i, atomic64_t *v)
{
asm volatile(LOCK_PREFIX "addq %1,%0"
: "=m" (v->counter)
: "er" (i), "m" (v->counter));
}

We have code that's similar to your problem description. We use a memory-mapped file, offsets, and file locking. We haven't found an alternative.

You shouldn't be afraid to make up an address at random, because the kernel will just reject addresses it doesn't like (ones that conflict). See my shmat() answer above, using 0x20000000000
With mmap:
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
If addr is not NULL, then the kernel
takes it as a hint about where to
place the mapping; on Linux, the
mapping will be created at the next
higher page boundary. The address of
the new mapping is returned as the
result of the call.
The flags argument determines whether
updates to the mapping are visible to
other processes mapping the same
region, and whether updates are
carried through to the underlying
file. This behavior is determined by
including exactly one of the following
values in flags:
MAP_SHARED Share this mapping.
Updates to the mapping are visible to
other processes that map this
file, and are carried through to the
underlying file. The file may not
actually be updated until msync(2) or
munmap() is called.
ERRORS
EINVAL We don’t like addr, length, or
offset (e.g., they are too large, or
not aligned on a page boundary).

Adding the offset to the pointer does not create the potential for a race, it already exists. Since at least neither ARM nor x86 can atomically read a pointer then access the memory it refers to you need to protect the pointer access with a lock regardless of whether you add an offset.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string