const int SIZE = 20;
struct Node { Node* next; };
std::atomic<Node*> head (nullptr);
void push (void* p)
{
Node* n = (Node*) p;
n->next = head.load ();
while (!head.compare_exchange_weak (n->next, n));
}
void* pop ()
{
Node* n = head.load ();
while (n &&
!head.compare_exchange_weak (n, n->next));
return n ? n : malloc (SIZE);
}
void thread_fn()
{
std::array<char*, 1000> pointers;
for (int i = 0; i < 1000; i++) pointers[i] = nullptr;
for (int i = 0; i < 10000000; i++)
{
int r = random() % 1000;
if (pointers[r] != nullptr) // allocated earlier
{
push (pointers[r]);
pointers[r] = nullptr;
}
else
{
pointers[r] = (char*) pop (); // allocate
// stamp the memory
for (int i = 0; i < SIZE; i++)
pointers[r][i] = 0xEF;
}
}
}
int main(int argc, char *argv[])
{
int N = 8;
std::vector<std::thread*> threads;
threads.reserve (N);
for (int i = 0; i < N; i++)
threads.push_back (new std::thread (thread_fn));
for (int i = 0; i < N; i++)
threads[i]->join();
}
What is wrong with this usage of compare_exchange_weak ? The above code crashes 1 in 5 times using clang++ (MacOSX).
The head.load() at the time of the crash will have "0xEFEFEFEFEF". pop is like malloc and push is like free. Each thread (8 threads) randomly allocate or deallocate memory from head
It could be nice lock-free allocator, but ABA-problem arise:
A: Assume, that some thread1 executes pop(), which reads current value of head into n variable, but immediately after this the thread is preemted and concurrent thread2 executes full pop() call, that is it reads same value from head and performs successfull compare_exchange_weak.
B: Now object, referred by n in the thread1, has no longer belonged to the list, and can be modified by thread2. So n->next is garbage in general: reading from it can return any value. For example, it can be 0xEFEFEFEFEF, where the first 5 bytes are stamp (EF), witch has been written by thread2, and the last 3 bytes are still 0, from nullptr. (Total value is numerically interpreted in little-endian manner). It seems that, because head value has been changed, thread1 will fail its compare_exchange_weak call, but...
A: Concurrent thread2 push()es resulted pointer back into the list. So thread1 sees initial value of head, and perform successfull compare_exchange_weak, which writes incorrect value into head. List is corrupted.
Note, that problem is more than possibility, that other thread can modify content of n->next. The problem is that value of n->next is no longer coupled with the list. So, even it is not modified concurrently, it becomes invalid (for replace head) in case, e.g., when other thread(s) pop() 2 elements from the list but push() back only first of them. (So n->next will points to the second element, which is has no longer belonged to the list.)
Related
ı want to calculate determinant of matrix with thread but i have a error "term does not eveluate to a function taking 0 arguments" ı want to solve big matrix with thread and parsing matrix,what can ı do
int determinant(int f[1000][1000], int x)
{
int pr, c[1000], d = 0, b[1000][1000], j, p, q, t;
if (x == 2)
{
d = 0;
d = (f[1][1] * f[2][2]) - (f[1][2] * f[2][1]);
return(d);
}
else
{
for (j = 1; j <= x; j++)
{
int r = 1, s = 1;
for (p = 1; p <= x; p++)
{
for (q = 1; q <= x; q++)
{
if (p != 1 && q != j)
{
b[r][s] = f[p][q];
s++;
if (s > x - 1)
{
r++;
s = 1;
}
}
}
}
for (t = 1, pr = 1; t <= (1 + j); t++)
pr = (-1)*pr;
c[j] = pr*determinant(b, x - 1);
}
for (j = 1, d = 0; j <= x; j++)
{
d = d + (f[1][j] * c[j]);
}
return(d);
}
}
int main()
{
srand(time_t(NULL));
int i, j;
printf("\n\nEnter order of matrix : ");
scanf_s("%d", &m);
printf("\nEnter the elements of matrix\n");
for (i = 1; i <= m; i++)
{
for (j = 1; j <= m; j++)
{
a[i][j] = rand() % 10;
}
}
thread t(determinant(a, m));
t.join();
printf("\n Determinant of Matrix A is %d .", determinant(a, m));
}
The immediate problem is that here: thread t(determinant(a, m)); you pass the result of calling determinant(a, m) as the function to execute, and zero arguments to call that function with - but an int is not a function or other callable object, which is what the error you got complains about.
std::thread's constructor takes the function to run and the arguments to supply separately, so you would need to call std::thread(determinant, a, m).
Now we have another problem, std::thread doesn't provide a way to retrieve the return value, and so you calculate it again here: printf("\n Determinant of Matrix A is %d .", determinant(a, m));.
To fix this, we can use std::async from the <future> header, which will manage the thread handling for us, and lets us retrieve the result later:
auto result = std::async(std::launch::async, determinant, a, m);
int det = result.get()
This will run determinant(a,m) on a new thread, and return a std::future<int> into which the return value may eventually be placed.
We can then try to retrieve that value with std::future::get(), which will block until the value can be retrieved (or until an exception occurs in the thread).
In this example, we still execute determinant in a pretty serial fashion, since we delegate the work to a thread, then wait for that thread to finish its work before continuing.
However we are now free to store the future, and defer calling std::future::get() until we actually need the value, potentially much later in your program.
There are a few other problems in the rest of your code:
all your array indexing is off by one (array indices run from 0 to N-1 in C and C++)
a few of the variables you're using don't exist (like a and m)
C-arrays are passed by pointer, so if you ever change the code not to block on the thread right there, the array will go out of scope and your thread may read garbage from the dangling pointer. If you use a proper container like std::array or std::vector, you can pass it by value so your thread will own the data to operate on for its entire lifetime.
I came across the persistent thread (PT) style implementation for non-homogeneous work distribution and wrote a simple kernel to compare the computation time with a kernel doing the same computations the usual way. But my test implementation is about 6 times slower than the ordinary implementation even without the overhead for sorting the buffer to get corresponding operations of 32. Is this a reasonable slowdown or am I overlooking something? I launched the PT kernel with global_work_size = local_work_size = CL_DEVICE_MAX_WORK_GROUP_SIZE, which is 512. If I chose less, than obviously it gets even slower.
This is the ordinary kernel:
__kernel void myKernel(const __global int* buffer)
{
int myIndex = get_local_id(0);
doSomeComputations(buffer[myIndex]); //just many adds and mults, no conditionals
}
And this is the PT style kernel:
__constant int finalIndex = 655360;
__kernel void myKernel(const __global int* buffer)
{
__local volatile int nextIndex;
if (get_local_id(0) == 0)
nextIndex = 0;
mem_fence(CLK_LOCAL_MEM_FENCE);
int myIndex;
while(true){
// get next index
myIndex = nextIndex + get_local_id(0);
if (myIndex > finalIndex)
return;
if ( get_local_id(0) == 0)
nextIndex += 512;
mem_fence(CLK_LOCAL_MEM_FENCE);
doSomeComputations(buffer[myIndex]); //same computations as above
}
}
I thought both implementations should take about the same time. Why is the PT style implementation so much slower? Thank you in advance.
------------Edited below this line-------------
So just to be clear. This kernel launched with global_work_size=655360 and local_work_size=512
__kernel void myKernel()
{
int myIndex = get_local_id(0);
volatile float result;
float4 test = float4(1.1f);
for(int i=0; i<1000; i++)
test = (test*test + test*test)/2.0;
result = test.x;
}
runs 6 times faster than this kernel launched with global_work_size=512 and local_work_size=512
__kernel void myKernel()
{
for(size_t idx = 0; idx < 655360; idx += get_local_size(0))
{
volatile float result;
float4 test = float4(1.1f);
for(int i=0; i<1000; i++)
test = (test*test + test*test)/2.0;
result = test.x;
}
}
You can reduce your second kernel to just this:
__kernel void myKernel(const __global int* buffer)
{
for(int x = 0; x < 655360; x += get_local_size(0))
doSomeComputations(buffer[x+get_local_id(0)]);
}
Update: added summary of the below conversation
First kernel (global_work_size=655360 and local_work_size=512) will be split into 655360/512 = 1280 work groups which will fully utilize the GPU. The second kernel (global_work_size=512 and local_work_size=512) will utilize just one computing unit which explains why the first one runs faster.
More details about persistent threads in GPU: persistent-threads-in-opencl-and-cuda.
My MPI code deadlocks when I run this simple code on 512 processes on a cluster. I am far from the memory limit. If I increase the number of procesess to 2048, which is far too many for this problem, the code runs again. The deadlock occurs in the line containing the MPI_File_write_all.
Any suggestions?
int count = imax*jmax*kmax;
// CREATE THE SUBARRAY
MPI_Datatype subarray;
int totsize [3] = {kmax, jtot, itot};
int subsize [3] = {kmax, jmax, imax};
int substart[3] = {0, mpicoordy*jmax, mpicoordx*imax};
MPI_Type_create_subarray(3, totsize, subsize, substart, MPI_ORDER_C, MPI_DOUBLE, &subarray);
MPI_Type_commit(&subarray);
// SET THE VALUE OF THE GRID EQUAL TO THE PROCESS ID FOR CHECKING
if(mpiid == 0) std::printf("Setting the value of the array\n");
for(int i=0; i<count; i++)
u[i] = (double)mpiid;
// WRITE THE FULL GRID USING MPI-IO
if(mpiid == 0) std::printf("Write the full array to disk\n");
char filename[] = "u.dump";
MPI_File fh;
if(MPI_File_open(commxy, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY | MPI_MODE_EXCL, MPI_INFO_NULL, &fh))
return 1;
// select noncontiguous part of 3d array to store the selected data
MPI_Offset fileoff = 0; // the offset within the file (header size)
char name[] = "native";
if(MPI_File_set_view(fh, fileoff, MPI_DOUBLE, subarray, name, MPI_INFO_NULL))
return 1;
if(MPI_File_write_all(fh, u, count, MPI_DOUBLE, MPI_STATUS_IGNORE))
return 1;
if(MPI_File_close(&fh))
return 1;
Your code looks right upon quick inspection. I would suggest that you let your MPI-IO library help tell you what's wrong: instead of returning from error, why don't you at least display the error? Here's some code that might help:
static void handle_error(int errcode, char *str)
{
char msg[MPI_MAX_ERROR_STRING];
int resultlen;
MPI_Error_string(errcode, msg, &resultlen);
fprintf(stderr, "%s: %s\n", str, msg);
MPI_Abort(MPI_COMM_WORLD, 1);
}
Is MPI_SUCCESS guaranteed to be 0? I'd rather see
errcode = MPI_File_routine();
if (errcode != MPI_SUCCESS) handle_error(errcode, "MPI_File_open(1)");
Put that in and if you are doing something tricky like setting a file view with offsets that are not monotonically non-decreasing, the error string might suggest what's wrong.
Here is my problem with a pthread code. When I run the following commands:
./run 1
./run 2
./run 4
the first two commands (one thread and two threads) generate the same output. However with 4 threads (third command), I see different outputs.
Now when I run the following commands
valgrind --tool=helgrind ./run 1
valgrind --tool=helgrind ./run 2
valgrind --tool=helgrind ./run 4
They generate the same outputs. The output values are correct though.
How can I investigate further?
The code looks like
int main(int argc,char *argv[])
{
// Barrier initialization
if(pthread_barrier_init(&barr, NULL, threads)) {
printf("Could not create a barrier\n");
return -1;
}
int t;
for(t = 0; t < threads; ++t) {
printf("In main: creating thread %ld\n", t);
if(pthread_create(&td[t], NULL, &foo, (void*)t)) {
printf("Could not create thread %d\n", t);
return -1;
}
}
...
}
void * foo(void *threadid)
{
long tid = (long)threadid;
for ( i = (tid*n/threads)+1; i <= (tid+1)*n/threads; i++ ) {
printf( "Thread %d, i=%d\n", tid, i );
for(largest = i, j = i+1; j <= n; j++) {
if(abs( a[j][i] ) > abs( a[largest][i] ))
largest = j;
}
for(k = i; k <= n+1; k++)
SWAP_DOUBLE( a[largest][k], a[i][k]);
for( j = i+1; j <= n; j++) {
for( k = n+1; k >= i; k--)
a[j][k] = a[j][k]-a[i][k]*a[j][i]/a[i][i];
}
}
int rc = pthread_barrier_wait(&barr);
if(rc != 0 && rc != PTHREAD_BARRIER_SERIAL_THREAD) {
printf("Could not wait on barrier\n");
exit(-1);
}
printf("after barrier\n");
...
}
The main loop (which iterate over i in foo()) is divided by the number of threads. assume all variables are defined properly since as I said there is no problem with 1 and 2 threads.
I'm not entirely sure what's going on, since you haven't given a complete compilable program to experiment with, but it's clear that each of the threads is reading/writing from sections of a that it aren't assigned to it, so you have race conditions all over the place. You are swapping sections of a so I'm not sure you can parallelize this algorithm as it stands.
I am new to c++ programming I have to call a function with following arguments.
int Start (int argc, char **argv).
When I try to call the above function with the code below I get run time exceptions. Can some one help me out in resolving the above problem.
char * filename=NULL;
char **Argument1=NULL;
int Argument=0;
int j = 0;
int k = 0;
int i=0;
int Arg()
{
filename = "Globuss -dc bird.jpg\0";
for(i=0;filename[i]!=NULL;i++)
{
if ((const char *)filename[i]!=" ")
{
Argument1[j][k++] = NULL; // Here I get An unhandled
// exception of type
//'System.NullReferenceException'
// occurred
j++;
k=0;
}
else
{
(const char )Argument1[j][k] = filename [j]; // Here I also i get exception
k++;
Argument++;
}
}
Argument ++;
return 0;
}
Start (Argument,Argument1);
Two things:
char **Argument1=NULL;
This is pointer to pointer, You need to allocate it with some space in memory.
*Argument1 = new char[10];
for(i=0, i<10; ++i) Argument[i] = new char();
Don't forget to delete in the same style.
You appear to have no allocated any memory to you arrays, you just have a NULL pointer
char * filename=NULL;
char **Argument1=NULL;
int Argument=0;
int j = 0;
int k = 0;
int i=0;
int Arg()
{
filename = "Globuss -dc bird.jpg\0";
//I dont' know why you have 2D here, you are going to need to allocate
//sizes for both parts of the 2D array
**Argument1 = new char *[TotalFileNames];
for(int x = 0; x < TotalFileNames; x++)
Argument1[x] = new char[SIZE_OF_WHAT_YOU_NEED];
for(i=0;filename[i]!=NULL;i++)
{
if ((const char *)filename[i]!=" ")
{
Argument1[j][k++] = NULL; // Here I get An unhandled
// exception of type
//'System.NullReferenceException'
// occurred
j++;
k=0;
}
else
{
(const char )Argument1[j][k] = filename [j]; // Here I also i get exception
k++;
Argument++;
}
}
Argument ++;
return 0;
}
The first thing you have to do is to find the number of the strings you will have. Thats easy done with something like:
int len = strlen(filename);
int numwords = 1;
for(i = 0; i < len; i++) {
if(filename[i] == ' ') {
numwords++;
// eating up all spaces to not count following ' '
// dont checking if i exceeds len, because it will auto-stop at '\0'
while(filename[i] == ' ') i++;
}
}
In the above code i assume there will be at least one word in the filename (i.e. it wont be an empty string).
Now you can allocate memory for Argument1.
Argument1 = new char *[numwords];
After that you have two options:
use strtok (http://www.cplusplus.com/reference/clibrary/cstring/strtok/)
implement your function to split a string
That can be done like this:
int i,cur,last;
for(i = last = cur = 0; cur < len; cur++) {
while(filename[last] == ' ') { // last should never be ' '
last++;
}
if(filename[cur] == ' ') {
if(last < cur) {
Argument1[i] = new char[cur-last+1]; // +1 for string termination '\0'
strncpy(Argument1[i], &filename[last], cur-last);
last = cur;
}
}
}
The above code is not optimized, i just tried to make it as easy as possible to understand.
I also did not test it, but it should work. Assumptions i made:
string is null terminated
there is at least 1 word in the string.
Also whenever im referring to a string, i mean a char array :P
Some mistakes i noticed in your code:
in c/c++ " " is a pointer to a const char array which contains a space.
If you compare it with another " " you will compare the pointers to them. They may (and probably will) be different. Use strcmp (http://www.cplusplus.com/reference/clibrary/cstring/strcmp/) for that.
You should learn how to allocate dynamically memory. In c you can do it with malloc, in c++ with malloc and new (better use new instead of malloc).
Hope i helped!
PS if there is an error in my code tell me and ill fix it.