The measured mcycle is less than minstret in Spike - riscv

I use spike to run the test program in "riscv-tools/riscv-tests/build/benchmarks":
$ spike multiply.riscv
And the output shows:
mcycle = 24096
minstret = 24103
Why mcycle is less than minstret?
Dose it means that spike can run more than one instructions in one cycle?
(I tried to trace spike code but cannot find how mcycle is counted.)

The printing of mcycle and minstret values are not from Spike in this case, it is from the test (benchmark). There is the code:
https://github.com/ucb-bar/riscv-benchmarks/blob/master/common/syscalls.c
#define NUM_COUNTERS 2
static uintptr_t counters[NUM_COUNTERS];
static char* counter_names[NUM_COUNTERS];
static int handle_stats(int enable)
{
int i = 0;
#define READ_CTR(name) do { \
while (i >= NUM_COUNTERS) ; \
uintptr_t csr = read_csr(name); \
if (!enable) { csr -= counters[i]; counter_names[i] = #name; } \
counters[i++] = csr; \
} while (0)
READ_CTR(mcycle);
READ_CTR(minstret);
#undef READ_CTR
return 0;
}
There is some code between reading mcycle & minstret. And now you know how much code is there between readings.
In Spike mcycle & minstret are always equal by definition (they are handled by the same code): https://github.com/riscv/riscv-isa-sim/blob/9e012462f53113dc9ed00d7fbb89aeafeb9b89e9/riscv/processor.cc#L347
case CSR_MINSTRET:
case CSR_MCYCLE:
if (xlen == 32)
state.minstret = (state.minstret >> 32 << 32) | (val & 0xffffffffU);
else
state.minstret = val;
break;
The syscalls.c was linked into multiply.riscv binary from https://github.com/ucb-bar/riscv-benchmarks/blob/master/multiply/bmark.mk - multiply_riscv_bin = multiply.riscv
$(multiply_riscv_bin): ... $(patsubst %.c, %.o, ... syscalls.c ... )
There is _init in syscalls.c function which calls main of the test and prints values, recorded on SYS_stats "syscall" with handle_stats.
void _init(int cid, int nc)
{
init_tls();
thread_entry(cid, nc);
// only single-threaded programs should ever get here.
int ret = main(0, 0);
char buf[NUM_COUNTERS * 32] __attribute__((aligned(64)));
char* pbuf = buf;
for (int i = 0; i < NUM_COUNTERS; i++)
if (counters[i])
pbuf += sprintf(pbuf, "%s = %d\n", counter_names[i], counters[i]);
if (pbuf != buf)
printstr(buf);
exit(ret);
}

Related

MPI4PY: ring communication with neighbor_alltoallw

Please Help!
I am using MPI (= Message Passing Interface) in python for a ring communication, which means that every rank are sending and receiving from each other. I know one way to realize this is by using for instance MPI.COMM_WORLD.issend()and MPI.COMM_WORLD.recv(), this is working and done.
Now I want to realize the same Output on a different way by using MPI.Topocomm.Neighbor_alltoallw but this is not working. I wrote a C Code and is working there, so the same output can be reached with this function, but when I implement this in python it is not working. Please find below the C Code and the Python Code
The definition of the Function says (mpi4py Package for Python):
Neighbor_alltoallw(...)
Topocomm.Neighbor_alltoallw(self, sendbuf, recvbuf)
Neighbor All-to-All Generalized
I do not understand following things:
why is recbuf not a return value? it seems to be an argument here
how can this be implmented for a ring communication in Python?
Thank you for your time and support!
my working C Code:
#include <stdio.h>
#include <mpi.h>
#define to_right 201
#define max_dims 1
int main (int argc, char *argv[])
{
int my_rank, size;
int snd_buf, rcv_buf;
int right, left;
int sum, i;
MPI_Comm new_comm;
int dims[max_dims],
periods[max_dims],
reorder;
MPI_Aint snd_displs[2], rcv_displs[2];
int snd_counts[2], rcv_counts[2];
MPI_Datatype snd_types[2], rcv_types[2];
MPI_Status status;
MPI_Request request;
MPI_Init(&argc, &argv);
/* Get process info. */
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Set cartesian topology. */
dims[0] = size;
periods[0] = 1;
reorder = 1;
MPI_Cart_create(MPI_COMM_WORLD, max_dims, dims, periods,
reorder,&new_comm);
/* Get coords */
MPI_Comm_rank(new_comm, &my_rank);
/* MPI_Cart_coords(new_comm, my_rank, max_dims, my_coords); */
/* Get nearest neighbour rank. */
MPI_Cart_shift(new_comm, 0, 1, &left, &right);
/* Compute global sum. */
sum = 0;
snd_buf = my_rank;
rcv_buf = -1000; /* unused value, should be overwritten by first MPI_Recv; only for test purpose */
rcv_counts[0] = 1; MPI_Get_address(&rcv_buf, &rcv_displs[0]); snd_types[0] = MPI_INT;
rcv_counts[1] = 0; rcv_displs[1] = 0 /*unused*/; snd_types[1] = MPI_INT;
snd_counts[0] = 0; snd_displs[0] = 0 /*unused*/; rcv_types[0] = MPI_INT;
snd_counts[1] = 1; MPI_Get_address(&snd_buf, &snd_displs[1]); rcv_types[1] = MPI_INT;
for( i = 0; i < size; i++)
{
/* Substituted by MPI_Neighbor_alltoallw() :
MPI_Issend(&snd_buf, 1, MPI_INT, right, to_right,
new_comm, &request);
MPI_Recv(&rcv_buf, 1, MPI_INT, left, to_right,
new_comm, &status);
MPI_Wait(&request, &status);
*/
MPI_Neighbor_alltoallw(MPI_BOTTOM, snd_counts, snd_displs, snd_types,
MPI_BOTTOM, rcv_counts, rcv_displs, rcv_types, new_comm);
snd_buf = rcv_buf;
sum += rcv_buf;
}
printf ("PE%i:\tSum = %i\n", my_rank, sum);
MPI_Finalize();
}
My not working Python Code:
from mpi4py import MPI
size = MPI.COMM_WORLD.Get_size()
my_rank = MPI.COMM_WORLD.Get_rank()
to_right =201
max_dims=1
dims = [max_dims]
periods=[max_dims]
dims[0]=size
periods[0]=1
reorder = True
new_comm=MPI.Intracomm.Create_cart(MPI.COMM_WORLD,dims,periods,True)
my_rank= new_comm.Get_rank()
left_right= MPI.Cartcomm.Shift(new_comm,0,1)
left=left_right[0]
right=left_right[1]
sum=0
snd_buf=my_rank
rcv_buf=-1000 #unused value, should be overwritten, only for test purpose
for counter in range(0,size):
MPI.Topocomm.Neighbor_alltoallw(new_comm,snd_buf,rcv_buf)
snd_buf=rcv_buf
sum=sum+rcv_buf
print('PE ', my_rank,'sum=',sum)

Please explain cache coherence

I've recently learned about false sharing, which in my understanding stems from the CPU's attempt to create cache coherence between different cores.
However, doesn't the following example demonstrate that cache coherence is violated?
The example below launches several threads that increase a global variable x, and several threads that assign the value of x to y, and an observer that tests if y>x. The condition y>x should never happen if there was memory coherence between the cores, as y is only increased after x was increased. However, this condition does happen according to the results of running this program. I tested it on visual studio both 64 and 86, both debug and release with pretty much the same results.
So, does memory coherence only happen when it's bad and never when it's good? :)
Please explain how cache coherence works and how it doesn't work. If you can guide me to a book that explains the subject I'll be grateful.
edit: I've added mfence where ever possible, still there is no memory coherence (presumably due to stale cache).
Also, I know the program has a data race, that's the whole point. My question is: Why is there a data race if the cpu maintains cache coherence (if it wasn't maintaining cache coherence, then what is false sharing and how does it happen?). Thank you.
#include <intrin.h>
#include <windows.h>
#include <iostream>
#include <thread>
#include <atomic>
#include <list>
#include <chrono>
#include <ratio>
#define N 1000000
#define SEPARATE_CACHE_LINES 0
#define USE_ATOMIC 0
#pragma pack(1)
struct
{
__declspec (align(64)) volatile long x;
#if SEPARATE_CACHE_LINES
__declspec (align(64))
#endif
volatile long y;
} data;
volatile long &g_x = data.x;
volatile long &g_y = data.y;
int g_observed;
std::atomic<bool> g_start;
void Observer()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
_mm_mfence();
long y = g_y;
_mm_mfence();
long x = g_x;
_mm_mfence();
if (y > x)
{
++g_observed;
}
}
}
void XIncreaser()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
InterlockedAdd(&g_x,1);
#else
_mm_mfence();
int x = g_x+1;
_mm_mfence();
g_x = x;
_mm_mfence();
#endif
}
}
void YAssigner()
{
while (!g_start);
for (int i = 0;i < N;++i)
{
#if USE_ATOMIC
long x = g_x;
InterlockedExchange(&g_y, x);
#else
_mm_mfence();
int x = g_x;
_mm_mfence();
g_y = x;
_mm_mfence();
#endif
}
}
int main()
{
using namespace std::chrono;
g_x = 0;
g_y = 0;
g_observed = 0;
g_start = false;
const int NAssigners = 4;
const int NIncreasers = 4;
std::list<std::thread> threads;
for (int i = 0;i < NAssigners;++i)
{
threads.emplace_back(YAssigner);
}
for (int i = 0;i < NIncreasers;++i)
{
threads.emplace_back(XIncreaser);
}
threads.emplace_back(Observer);
auto tic = high_resolution_clock::now();
g_start = true;
for (std::thread& t : threads)
{
t.join();
}
auto toc = high_resolution_clock::now();
std::cout << "x = " << g_x << " y = " << g_y << " number of times y > x = " << g_observed << std::endl;
std::cout << "&x = " << (int*)&g_x << " &y = " << (int*)&g_y << std::endl;
std::chrono::duration<double> t = toc - tic;
std::cout << "time elapsed = " << t.count() << std::endl;
std::cout << "USE_ATOMIC = " << USE_ATOMIC << " SEPARATE_CACHE_LINES = " << SEPARATE_CACHE_LINES << std::endl;
return 0;
}
Example output:
x = 1583672 y = 1583672 number of times y > x = 254
&x = 00007FF62BE95800 &y = 00007FF62BE95804
time elapsed = 0.187785
USE_ATOMIC = 0 SEPARATE_CACHE_LINES = 0
False sharing is mainly related to performance, not coherence or program order. The cpu cache works on a granularity which is typically 16, 32, 64,... bytes. That means if two independent data items are close together in memory, they will experience each others cache operations. Specifically, if &a % CACHE_LINE_SIZE == &b % CACHE_LINE_SIZE, then they will share a cache line.
For example, if cpu0 & 1 are fighting over a, and cpu 2 & 3 are fighting over b, the cache line containing a & b will thrash between each of the 4 caches. This is the effect of false sharing, and it causes a large performance drop.
False sharing happens because the coherence algorithm in the caches demand that there is a consistent view of memory. A good way to examine it is to put two atomic counters in a structure spaced out by one or two k:
struct a {
long a;
long pad[1024];
long b;
};
and find a nice little machine language function to do an atomic increment. Then cut loose NCPU/2 threads incrementing a and NCPU/2 threads incrementing b until they reach a big number.
Then repeat, commenting out the pad array. Compare the times.
When you are trying to get at machine details, clarity and precision are your friends; C++ and weird attribute declarations aren’t.

Generate Checksum for String

I would like to Generate Checksum for Strings/Data
1. The same data should produce the same Checksum
2. Two different data strings can't product same checksum. Random collision of 0.1% can be negligible
3. No encryption/decryption of data
4. Checksum length need not be too huge and contains letters and characters.
5. Must be too fast and efficient. Imagine generating checksum(s) for 100 Mb of text data should be in less than 5mins. Generating 1000 checksums for less than 1 KB of each segment data should be in less than 10 seconds.
Any algorithm or implementation reference and suggestions are most appreciated.
You can write a custom hash function: (c++)
long long int hash(String s){
long long k = 7;
for(int i = 0; i < s.length(); i++){
k *= 23;
k += s[i];
k *= 13;
k %= 1000000009;
}
return k;
}
This should give you a well (collision free for most samples) hash value.
A very common, fast checksum is the CRC-32, a 32-bit polynomial cyclic redundancy check. Here are three implementations in C, which vary in speed vs. complexity, of the CRC-32: (This is from http://www.hackersdelight.org/hdcodetxt/crc.c.txt)
#include <stdio.h>
#include <stdlib.h>
// ---------------------------- reverse --------------------------------
// Reverses (reflects) bits in a 32-bit word.
unsigned reverse(unsigned x) {
x = ((x & 0x55555555) << 1) | ((x >> 1) & 0x55555555);
x = ((x & 0x33333333) << 2) | ((x >> 2) & 0x33333333);
x = ((x & 0x0F0F0F0F) << 4) | ((x >> 4) & 0x0F0F0F0F);
x = (x << 24) | ((x & 0xFF00) << 8) |
((x >> 8) & 0xFF00) | (x >> 24);
return x;
}
// ----------------------------- crc32a --------------------------------
/* This is the basic CRC algorithm with no optimizations. It follows the
logic circuit as closely as possible. */
unsigned int crc32a(unsigned char *message) {
int i, j;
unsigned int byte, crc;
i = 0;
crc = 0xFFFFFFFF;
while (message[i] != 0) {
byte = message[i]; // Get next byte.
byte = reverse(byte); // 32-bit reversal.
for (j = 0; j <= 7; j++) { // Do eight times.
if ((int)(crc ^ byte) < 0)
crc = (crc << 1) ^ 0x04C11DB7;
else crc = crc << 1;
byte = byte << 1; // Ready next msg bit.
}
i = i + 1;
}
return reverse(~crc);
}
// ----------------------------- crc32b --------------------------------
/* This is the basic CRC-32 calculation with some optimization but no
table lookup. The the byte reversal is avoided by shifting the crc reg
right instead of left and by using a reversed 32-bit word to represent
the polynomial.
When compiled to Cyclops with GCC, this function executes in 8 + 72n
instructions, where n is the number of bytes in the input message. It
should be doable in 4 + 61n instructions.
If the inner loop is strung out (approx. 5*8 = 40 instructions),
it would take about 6 + 46n instructions. */
unsigned int crc32b(unsigned char *message) {
int i, j;
unsigned int byte, crc, mask;
i = 0;
crc = 0xFFFFFFFF;
while (message[i] != 0) {
byte = message[i]; // Get next byte.
crc = crc ^ byte;
for (j = 7; j >= 0; j--) { // Do eight times.
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
i = i + 1;
}
return ~crc;
}
// ----------------------------- crc32c --------------------------------
/* This is derived from crc32b but does table lookup. First the table
itself is calculated, if it has not yet been set up.
Not counting the table setup (which would probably be a separate
function), when compiled to Cyclops with GCC, this function executes in
7 + 13n instructions, where n is the number of bytes in the input
message. It should be doable in 4 + 9n instructions. In any case, two
of the 13 or 9 instrucions are load byte.
This is Figure 14-7 in the text. */
unsigned int crc32c(unsigned char *message) {
int i, j;
unsigned int byte, crc, mask;
static unsigned int table[256];
/* Set up the table, if necessary. */
if (table[1] == 0) {
for (byte = 0; byte <= 255; byte++) {
crc = byte;
for (j = 7; j >= 0; j--) { // Do eight times.
mask = -(crc & 1);
crc = (crc >> 1) ^ (0xEDB88320 & mask);
}
table[byte] = crc;
}
}
/* Through with table setup, now calculate the CRC. */
i = 0;
crc = 0xFFFFFFFF;
while ((byte = message[i]) != 0) {
crc = (crc >> 8) ^ table[(crc ^ byte) & 0xFF];
i = i + 1;
}
return ~crc;
}
If you simply google "CRC32", you will get more info than you could possibly absorb.

ArrayFire frame search algorithm crash

I am new to ArrayFire and CUDA development in general, I just started using ArrayFire a couple of days ago after failing miserably using Thrust.
I am building an ArrayFire-based algorithm that is supposed to search a single 32x32 pixel frame in a database of a couple hundred thousand 32x32 frames that are stored into device memory.
At first I initialize a matrix that has 1024 + 1 pixels as rows (I need an extra one to keep a frame group id) and a predefined number (this case 1000) of frames, indexed by coloumn.
Here's the function that performs the search, if I uncomment "pixels_uint32 = device_frame_ptr[pixel_group_idx];" the program crashes. The pointer seems to be valid so I do not understand why this happens. Maybe there is something I do not know regarding accessing device memory in this way?
#include <iostream>
#include <stdio.h>
#include <sys/types.h>
#include <arrayfire.h>
#include "utils.h"
using namespace af;
using namespace std;
/////////////////////////// CUDA settings ////////////////////////////////
#define TEST_DEBUG false
#define MAX_NUMBER_OF_FRAMES 1000 // maximum (2499999 frames) X (1024 + 1 pixels per frame) x (2 bytes per pixel) = 5.124.997.950 bytes (~ 5GB)
#define BLOB_FINGERPRINT_SIZE 1024 //32x32
//percentage of macroblocks that should match: 0.9 means 90%
#define MACROBLOCK_COMPARISON_OVERALL_THRESHOLD 768 //1024 * 0.75
//////////////////////// End of CUDA settings ////////////////////////////
array search_frame(array d_db_vec)
{
try {
uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;
// create one-element array to hold the result of the computation
array frame_found(1,MAX_NUMBER_OF_FRAMES, u32);
frame_found = 0;
gfor (array frame_idx, MAX_NUMBER_OF_FRAMES) {
// get the blob id it's the last coloumn of the matrix
array blob_id = d_db_vec(number_of_uint32_for_frame, frame_idx); // addressing with (pixel_idx, frame_idx)
// define some hardcoded pixel to search for
uint8_t searched_r = 0x0;
uint8_t searched_g = 0x3F;
uint8_t searched_b = 0x0;
uint8_t b1 = 0;
uint8_t g1 = 0;
uint8_t r1 = 0;
uint8_t b2 = 0;
uint8_t g2 = 0;
uint8_t r2 = 0;
uint32_t sum1 = 0;
uint32_t sum2 = 0;
uint32_t *device_frame_ptr = NULL;
uint32_t pixels_uint32 = 0;
uint pixel_match_counter = 0;
//uint pixel_match_counter = 0;
array frame = d_db_vec(span, frame_idx);
device_frame_ptr = frame.device<uint32_t>();
for (uint pixel_group_idx = 0; pixel_group_idx < number_of_uint32_for_frame; pixel_group_idx++) {
// test to see if the whole matrix is traversed
// d_db_vec(pixel_group_idx, frame_idx) = 0;
/////////////////////////////// PROBLEMATIC CODE ///////////////////////////////////
pixels_uint32 = 0x7E007E0;
//pixels_uint32 = device_frame_ptr[pixel_group_idx]; //why does this crash the program?
// if I uncomment the above line the program tries to copy the u32 frame into the pixels_uint32 variable
// something goes wrong, since the pointer device_frame_ptr is not NULL and the elements should be there judging by the lines above
////////////////////////////////////////////////////////////////////////////////////
// splitting the first pixel into its components
b1 = (pixels_uint32 & 0xF8000000) >> 27; //(input & 11111000000000000000000000000000)
g1 = (pixels_uint32 & 0x07E00000) >> 21; //(input & 00000111111000000000000000000000)
r1 = (pixels_uint32 & 0x001F0000) >> 16; //(input & 00000000000111110000000000000000)
// splitting the second pixel into its components
b2 = (pixels_uint32 & 0xF800) >> 11; //(input & 00000000000000001111100000000000)
g2 = (pixels_uint32 & 0x07E0) >> 5; //(input & 00000000000000000000011111100000)
r2 = (pixels_uint32 & 0x001F); //(input & 00000000000000000000000000011111)
// checking if they are a match
sum1 = abs(searched_r - r1) + abs(searched_g - g1) + abs(searched_b - b1);
sum2 = abs(searched_r - r2) + abs(searched_g - g2) + abs(searched_b - b2);
// if they match, increment the local counter
pixel_match_counter = (sum1 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
pixel_match_counter = (sum2 <= 16) ? pixel_match_counter + 1 : pixel_match_counter;
}
bool is_found = pixel_match_counter > MACROBLOCK_COMPARISON_OVERALL_THRESHOLD;
// write down if the frame is a match or not
frame_found(0,frame_idx) = is_found ? frame_found(0,frame_idx) : blob_id;
}
// test to see if the whole matrix is traversed - this has to print zeroes
if (TEST_DEBUG)
print(d_db_vec);
// return the matches array
return frame_found;
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
throw;
}
}
// make 2 green pixels
uint32_t make_test_pixel_group() {
uint32_t b1 = 0x0; //11111000000000000000000000000000
uint32_t g1 = 0x7E00000; //00000111111000000000000000000000
uint32_t r1 = 0x0; //00000000000111110000000000000000
uint32_t b2 = 0x0; //00000000000000001111100000000000
uint32_t g2 = 0x7E0; //00000000000000000000011111100000
uint32_t r2 = 0x0; //00000000000000000000000000011111
uint32_t green_pix = b1 | g1 | r1 | b2 | g2 | r2;
return green_pix;
}
int main(int argc, char ** argv)
{
info();
/////////////////////////////////////// CREATE THE DATABASE ///////////////////////////////////////
uint number_of_uint32_for_frame = BLOB_FINGERPRINT_SIZE / 2;
array d_db_vec(number_of_uint32_for_frame + 1, // fingerprint size + 1 extra u32 for blob id
MAX_NUMBER_OF_FRAMES, // number of frames
u32); // type of elements is 32-bit unsigned integer (unsigned) with the configuration RGBRGB (565565)
if (TEST_DEBUG == true) {
for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
for (uint pix_idx = 0; pix_idx < number_of_uint32_for_frame; pix_idx++) {
d_db_vec(pix_idx, frame_idx) = make_test_pixel_group(); // fill everything with green :D
}
}
} else {
d_db_vec = rand(number_of_uint32_for_frame + 1, MAX_NUMBER_OF_FRAMES);
}
cout << "Setting blob ids. \n\n";
for (uint frame_idx = 0; frame_idx < MAX_NUMBER_OF_FRAMES; frame_idx++) {
// set the blob id to 123456
d_db_vec(number_of_uint32_for_frame, frame_idx) = 123456; // blob_id = 123456
}
if (TEST_DEBUG)
print(d_db_vec);
cout << "Done setting blob ids. \n\n";
//////////////////////////////////// CREATE THE SEARCHED FRAME ///////////////////////////////////
// to be done, for now we use the hardcoded values at line 37-39 to simulate the searched pixel:
//37 uint8_t searched_r = 0x0;
//38 uint8_t searched_g = 0x3F;
//39 uint8_t searched_b = 0x0;
///////////////////////////////////////////// SEARCH /////////////////////////////////////////////
clock_t timer = startTimer();
for (int i = 0; i< 1000; i++) {
array frame_found = search_frame(d_db_vec);
if (TEST_DEBUG)
print(frame_found);
}
stopTimer(timer);
return 0;
}
Here is the console output with the line commented:
arrayfire/examples/helloworld$ ./helloworld
ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)
License: Server (27000#server.accelereyes.com)
CUDA toolkit 5.0, driver 304.54
GPU0 Tesla C2075, 5376 MB, Compute 2.0
Memory Usage: 5312 MB free (5376 MB total)
Setting blob ids.
Done setting blob ids.
Time: 0.03 seconds.
Here is the console output with the line uncommented:
arrayfire/examples/helloworld$ ./helloworld
ArrayFire v1.9.1 (64-bit Linux, build 9af23ea)
License: Server (27000#server.accelereyes.com)
CUDA toolkit 5.0, driver 304.54
GPU0 Tesla C2075, 5376 MB, Compute 2.0
Memory Usage: 5312 MB free (5376 MB total)
Setting blob ids.
Done setting blob ids.
Segmentation fault
Thanks in advance for any help on this issue. I really tried everything but without success.
Disclaimer: I am the lead developer of arrayfire. I see that you have posted on AccelerEyes forums as well, but I am posting here to clear up some common issues with your code.
Do not use .device(), .host(), .scalar() inside gfor loop. This will cause divergences inside the GFOR loop, and GFOR was not designed for this.
You can not index into a device pointer. The pointer refers to a location on the GPU. When you do device_frame_ptr[pixel_group_idx];, the system is looking for the equivalent position on the CPU. This is the reason for your segmentation fault.
Use vectorized code. For example, you don't need the inner for loop of the gfor. Instead of doing b1 = (pixels_uint32 & 0xF8000000) >> 27; inside a for loop, You can do array B1 = (frame & 0xF800000000) >> 27;. i.e. instead of getting data back to CPU and using a for loop, you are doing the entire operation inside the GPU.
Don't use if-else or ternary operators inside GFOR. These cause divergences again. For example, pixel_match_counter = sum(sum1 <= 16) + sum(sum2 < 16); and found(0, found_idx) = is_found * found(0, found_idx) + (1 - is_found) * blob_id.
I have answered the particular problem you are facing. If you have any follow up questions, please follow up on our forums and / or our support email. Stackoverflow is good for asking a specific question, but not to debug your entire program.

possible race condition in pthread application (unable to detect)

Here is my problem with a pthread code. When I run the following commands:
./run 1
./run 2
./run 4
the first two commands (one thread and two threads) generate the same output. However with 4 threads (third command), I see different outputs.
Now when I run the following commands
valgrind --tool=helgrind ./run 1
valgrind --tool=helgrind ./run 2
valgrind --tool=helgrind ./run 4
They generate the same outputs. The output values are correct though.
How can I investigate further?
The code looks like
int main(int argc,char *argv[])
{
// Barrier initialization
if(pthread_barrier_init(&barr, NULL, threads)) {
printf("Could not create a barrier\n");
return -1;
}
int t;
for(t = 0; t < threads; ++t) {
printf("In main: creating thread %ld\n", t);
if(pthread_create(&td[t], NULL, &foo, (void*)t)) {
printf("Could not create thread %d\n", t);
return -1;
}
}
...
}
void * foo(void *threadid)
{
long tid = (long)threadid;
for ( i = (tid*n/threads)+1; i <= (tid+1)*n/threads; i++ ) {
printf( "Thread %d, i=%d\n", tid, i );
for(largest = i, j = i+1; j <= n; j++) {
if(abs( a[j][i] ) > abs( a[largest][i] ))
largest = j;
}
for(k = i; k <= n+1; k++)
SWAP_DOUBLE( a[largest][k], a[i][k]);
for( j = i+1; j <= n; j++) {
for( k = n+1; k >= i; k--)
a[j][k] = a[j][k]-a[i][k]*a[j][i]/a[i][i];
}
}
int rc = pthread_barrier_wait(&barr);
if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD) {
printf("Could not wait on barrier\n");
exit(-1);
}
printf("after barrier\n");
...
}
The main loop (which iterate over i in foo()) is divided by the number of threads. assume all variables are defined properly since as I said there is no problem with 1 and 2 threads.
I'm not entirely sure what's going on, since you haven't given a complete compilable program to experiment with, but it's clear that each of the threads is reading/writing from sections of a that it aren't assigned to it, so you have race conditions all over the place. You are swapping sections of a so I'm not sure you can parallelize this algorithm as it stands.

Resources