Error while reading same mem positions on different threads - multithreading

I have a problem while reading a couple of positions in a double array from different threads.
I enqueue the execution with :
nelements = nx*ny;
err = clEnqueueNDRangeKernel(queue,kernelTvl2of,1,NULL,&nelements,NULL,0,NULL,NULL);
kernelTvl2of has (among other) the code
size_t k = get_global_id(0);
(...)
u1_[k] = (float)u1[k];
(...)
barrier(CLK_GLOBAL_MEM_FENCE);
forwardgradient(u1_,u1x,u1y,k,nx,ny);
barrier(CLK_GLOBAL_MEM_FENCE);
and forwardgradient has the code:
void forwardgradient(global double *f, global double *fx, global double *fy, int ker,int nx, int ny){
unsigned int rowsnotlast = ((nx)*(ny-1));
if(ker<rowsnotlast){
fx[ker] = f[ker+1] - f[ker];
fy[ker] = f[ker+nx] - f[ker];
}
if(ker<nx*ny){
fx[ker] = f[ker+1] - f[ker];
if(ker==4607){
fx[0] = f[4607];
fx[1] = f[4608];
fx[2] = f[4608] - f[4607];
fx[3] = f[ker];
fx[4] = f[ker+1];
fx[5] = f[ker+1] - f[ker];
}
}
if(ker==(nx*ny)-1){
fx[ker] = 0;
fy[ker] = 0;
}
if(ker%nx == nx-1){
fx[ker]=0;
}
fx[6] = f[4608];
}
When I get the contents of the first positions of fx, they are:
-6 0 6 -6 0 6 -6
And here's my problem: when I query fx[ker+1] or fx[4608] on thread with id 4607 I get a '0' (positions second and fifth of the output array), but from other threads I get a '-6' last position of the output array)
Anyone has a clue on what I'm doing wrong, or where I could look to?
Thanks a lot,
Anton

Within a kernel, global memory consistency is only achievable within a single work-group. This means that if a work-item writes a value to global memory, a barrier(CLK_GLOBAL_MEM_FENCE) only guarantees that other work-items within the same work-group will be able to read the updated value.
If you need global memory consistency across multiple work-groups, you need to split your kernel into multiple kernels.

Related

Concurrency in Linux: Alternate access of two threads in critical zone

I'm new with the concepts of concurrency and threads in Linux and I tried to solve a relative simple problem. I create two threads which run the same function which increment a global variable. What I really want from my program is to increment that variable alternatively, namely ,say in each step the thread that increment that variable prints to the screen a message , so the expected output should look like:
Thread 1 is incrementing variable count... count = 1
Thread 2 is incrementing variable count... count = 2
Thread 1 is incrementing variable count... count = 3
Thread 2 is incrementing variable count... count = 4
and so on.
I tried an implementation with a semaphore which ensures mutual exclusion, but nonetheless the result resembles this:
Thread 2 is incrementing variable count... count = 1
Thread 2 is incrementing variable count... count = 2
Thread 2 is incrementing variable count... count = 3
Thread 2 is incrementing variable count... count = 4
Thread 2 is incrementing variable count... count = 5
Thread 1 is incrementing variable count... count = 6
Thread 1 is incrementing variable count... count = 7
Thread 1 is incrementing variable count... count = 8
Thread 1 is incrementing variable count... count = 9
Thread 1 is incrementing variable count... count = 10
My code looks like:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <semaphore.h>
int count = 0;
sem_t mutex;
void function1(void *arg)
{
int i = 0;
int *a = (int*) arg;
while (i < 10)
{
sem_wait(&mutex);
count++;
i++;
printf("From the function : %d count is %d\n",*a,count);
sem_post(&mutex);
}
}
int main()
{
pthread_t t1,t2;
int a = 1;
int b = 2;
pthread_create(&t1,NULL,(void *)function1,&a);
pthread_create(&t2,NULL,(void *)function1,&b);
sem_init(&mutex,0,1);
pthread_join(t2,NULL);
pthread_join(t1,NULL);
sem_destroy(&mutex);
return 0;
}
My big question is now , how do I achieve this alternation between threads? I got mutual exclusion , but the alternation is still missing. Maybe I lack a good insight of semaphores usage, but I would be very grateful if someone would explain that to me. (I have read several courses on this topic ,namely,Linux semaphores,concurrency and threads, but the information there wasn't satisfactory enough)
Mutex locks don't guarantee any fairness. This means that if thread 1 releases it and then attempts to get it again, there is no guarantee that it won't - it's just guarantees that no two pieces of code can be running in that block at the same time.
EDIT: Removed previous C-style solution because it was probably incorrect. The question asks for a synchronization solution.
If you really want to do this correctly you would use something called a monitor and a guard (or condition variable). I'm not too familiar with C and pThreads so you would need to take a look how to do it with that, but in Java it would look something like:
public class IncrementableInteger {
public int value;
#Override
public String toString() {
return String.valueOf(value);
}
}
#Test
public void testThreadAlternating() throws InterruptedException {
IncrementableInteger i = new IncrementableInteger();
Monitor m = new Monitor();
Monitor.Guard modIsZeroGuard = new Monitor.Guard(m) {
#Override public boolean isSatisfied() {
return i.value % 2 == 0;
}
};
Monitor.Guard modIsOneGuard = new Monitor.Guard(m) {
#Override public boolean isSatisfied() {
return i.value % 2 == 1;
}
};
Thread one = new Thread(() -> {
while (true) {
m.enterWhenUninterruptibly(modIsZeroGuard);
try {
if (i.value >= 10) return;
i.value++;
System.out.println("Thread 1 inc: " + String.valueOf(i));
} finally {
m.leave();
}
}
});
Thread two = new Thread(() -> {
while (true) {
m.enterWhenUninterruptibly(modIsOneGuard);
try {
if (i.value >= 10) return;
i.value++;
System.out.println("Thread 2 inc: " + String.valueOf(i));
} finally {
m.leave();
}
}
});
one.start();
two.start();
one.join();
two.join();
}

Given length and number of digits,we have to find minimum and maximum number that can be made?

As the question states,we are given a positive integer M and a non-negative integer S. We have to find the smallest and the largest of the numbers that have length M and sum of digits S.
Constraints:
(S>=0 and S<=900)
(M>=1 and M<=100)
I thought about it and came to conclusion that it must be Dynamic Programming.However I failed to build DP state.
This is what I thought:-
dp[i][j]=First 'i' digits having sum 'j'
And tried to make program.This is how it looks like
/*
*** PATIENCE ABOVE PERFECTION ***
"When in doubt, use brute force. :D"
-Founder of alloj.wordpress.com
*/
#include<bits/stdc++.h>
using namespace std;
#define pb push_back
#define mp make_pair
#define nline cout<<"\n"
#define fast ios_base::sync_with_stdio(false),cin.tie(0)
#define ull unsigned long long int
#define ll long long int
#define pii pair<int,int>
#define MAXX 100009
#define fr(a,b,i) for(int i=a;i<b;i++)
vector<int>G[MAXX];
int main()
{
int m,s;
cin>>m>>s;
int dp[m+1][s+1];
fr(1,m+1,i)
fr(1,s+1,j)
fr(0,10,k)
dp[i][j]=min(dp[i-1][j-k]+k,dp[i][j]); //Tried for Minimum
cout<<dp[m][s]<<endl;
return 0;
}
Please guide me about this DP state and what will be the time complexity of the program.This is my first try of DP.
dp solution goes here :-
#include<iostream>
using namespace std;
int dp[102][902][2] ;
void print_ans(int m , int s , int flag){
if(m==0)
return ;
cout<<dp[m][s][flag];
if(dp[m][s][flag]!=-1)
print_ans(m-1 , s-dp[m][s][flag] , flag );
return ;
}
int main(){
//freopen("problem.in","r",stdin);
//freopen("out.txt","w",stdout);
//int t;
//cin>>t;
//while(t--){
int m , s ;
cin>>m>>s;
if(s==0){
cout<<(m==1?"0 0":"-1 -1");
return 0;
}
for(int i = 0 ; i <=m ; i++){
for(int j=0 ; j<=s ;j++){
dp[i][j][0]=-1;
dp[i][j][1]=-1;
}
}
for(int i = 0 ; i < 10 ; i++){
dp[1][i][0]=i;
dp[1][i][1]=i;
}
for(int i = 2 ; i<=m ; i++){
for(int j = 0 ; j<=s ; j++){
int flag = -1;
int f = -1;
for(int k = 0 ; k <= 9 ; k++){
if(i==m&&k==0)
continue;
if( j>=k && flag==-1 && dp[i-1][j-k][0]!=-1)
flag = k;
}
for(int k = 9 ; k >=0 ;k--){
if(i==m&&k==0)
continue;
if( j>=k && f==-1 && dp[i-1][j-k][1]!=-1)
f = k;
}
dp[i][j][0]=flag;
dp[i][j][1]=f;
}
}
if(m!=0){
print_ans(m , s , 0);
cout<<" ";
print_ans(m,s,1);
}
else
cout<<"-1 -1";
cout<<endl;
// }
}
The DP state is (i,j). It can be thought of as the parameters of a mathematical function defined in terms of recurrences(Smaller problems ,Hence sub problems!)
More deeply,
State is generally the number of parameters to identify the problem uniquely , so that we always know on what we are computing on!!
Let us take the example of your question only
Just to define your problem we will need Number of Digits in the state + Sums that can be formed with these Digits (Note: You are kind of collectively keeping the sum while traversing through digits!)
I think that is enough for the state part.
Now,
Running time of Dynamic Programming is very simple.
First Let us see how many sub problems exist in a problem :
You need to fill up each and every state i.e. You have to cover all the unique sub problems smaller than or equal to the whole problem !!
Which problem is smaller than the other is known by the recurrent relation !!
For example:
Fibonacci Sequence
F(n)=F(n-1)+F(n-2)
Note the base case , is always the smallest sub problem .!!
Note Here for F(n) We have to calculate F(n-1) and F(n-2) , And it will reach a stage where n=1 , where you need to return the base case!!
Hence the total number of sub problems can be said as all the problems between the base case and the current problem!
Now,
In bottom up , we need to process each and every state in terms of size between this base case and problem!
Now, This tells us that the Running time should be
O(Number of Subproblems * Time per each subproblem).
So how many subproblems exist in your solution DP[0][0] to DP[M][S]
and for every problem you are running a loop of 10
O( M*S (Subproblems ) * 10 )
Chop that constant of!
But it is not necessarily a constant always!!
Here is some code which you might want to look! Feel free to ask anything !
#include<bits/stdc++.h>
using namespace std;
bool DP[9][101];
int Number[9][101];
int main()
{
DP[0][0]=true; // It is possible to form 0 using NULL digits!!
int N=9,S=100,i,j,k;
for(i=1;i<=9;++i)
for(j=0;j<=100;++j)
{
if(DP[i-1][j])
{
for(k=0;k<=9;++k)
if(j+k<=100)
{
DP[i][j+k]=true;
Number[i][j+k]=Number[i-1][j]*10+k;
}
}
}
cout<<Number[9][81]<<"\n";
return 0;
}
You can rather use backtracking rather than storing the numbers directly just because your constraints are high!
DP[i][j] represents if it is possible to form sum of digits using i digits only!!
Number[i][j]
is my laziness to avoid typing a backtrack way(Sleepy, its already 3A.M.)
I am trying to add all the possible digits to extend the state.
It is essentially kind of forward DP style!! You can read more about it at Topcoder

ArgumentException while reading using readblock streamreader

I am trying to calculate row count from a large file based on presence of a certain character and would like to use StreamReader and ReadBlock - below is my code.
protected virtual long CalculateRowCount(FileStream inStream, int bufferSize)
{
long rowCount=0;
String line;
inStream.Position = 0;
TextReader reader = new StreamReader(inStream);
char[] block = new char[4096];
const int blockSize = 4096;
int indexer = 0;
int charsRead = 0;
long numberOfLines = 0;
int count = 1;
do
{
charsRead = reader.ReadBlock(block, indexer, block.Length * count);
indexer += blockSize ;
numberOfLines = numberOfLines + string.Join("", block).Split(new string[] { "&ENDE" }, StringSplitOptions.None).Length;
count ++;
} while (charsRead == block.Length);//charsRead !=0
reader.Close();
fileRowCount = rowCount;
return rowCount;
}
But I get error
Offset and length were out of bounds for the array or count is greater than the number of elements from index to the end of the source collection.
I am not sure what is wrong... Can you help. Thanks ahead!
For one, read the StreamReader.ReadBlock() documentation carefully http://msdn.microsoft.com/en-us/library/system.io.streamreader.readblock.aspx and compare with what you're doing:
The 2nd argument (indexer) should be within the range of the block you're passing in, but you're passing something that will probably exceed it after one iteration. Since it looks like you want to reuse the memory block, pass 0 here.
The 3rd argument (count) indicates how many bytes to read into your memory block; passing something larger than the block size might not work (depends on implementation)
ReadBlock() returns the number of bytes actually read, but you increment indexer as if it will always return the size of the block exactly (most of the time, it won't)

SSE instruction is giving error

I am using the following code to divide all int array elements with constant factor using SSE.
void sse_div(int *arr,int num_shift,int N) // devide all array elements by 2
{
num_shift=1;
int nb_iters = N / 4;
__declspec(align(32))int *a1=arr;
__m128i* l = (__m128i*)a1;
for (int i = 0; i < nb_iters; ++i, ++l)
_mm_store_si128( l, _mm_srai_epi32(*l,num_shift)); //Error line
}
But I am getting the following error
I am unable to get rid of this problem.
Can anybody please help to solve this problem.
Any help will be appreciated.
Thanks in Advance
Since your input array is apparently misaligned you can use unaligned loads/stores, e.g.:
void sse_div(int *arr, int N) // divide all array elements by 2
{
for (int i = 0; i < nb_iters; i += 4)
{
__m128i v = _mm_loadu_si128(&arr[i]);
v = _mm_srai_epi32(v, 1);
_mm_storeu_si128(&arr[i], v);
}
}
Note that there may be a significant performance hit from using unaligned loads/stores (depending on what CPU you are running on), so if possible you should make your arr array 16 byte aligned when you allocate the memory.

MPI-IO deadlock using MPI_File_write_all

My MPI code deadlocks when I run this simple code on 512 processes on a cluster. I am far from the memory limit. If I increase the number of procesess to 2048, which is far too many for this problem, the code runs again. The deadlock occurs in the line containing the MPI_File_write_all.
Any suggestions?
int count = imax*jmax*kmax;
// CREATE THE SUBARRAY
MPI_Datatype subarray;
int totsize [3] = {kmax, jtot, itot};
int subsize [3] = {kmax, jmax, imax};
int substart[3] = {0, mpicoordy*jmax, mpicoordx*imax};
MPI_Type_create_subarray(3, totsize, subsize, substart, MPI_ORDER_C, MPI_DOUBLE, &subarray);
MPI_Type_commit(&subarray);
// SET THE VALUE OF THE GRID EQUAL TO THE PROCESS ID FOR CHECKING
if(mpiid == 0) std::printf("Setting the value of the array\n");
for(int i=0; i<count; i++)
u[i] = (double)mpiid;
// WRITE THE FULL GRID USING MPI-IO
if(mpiid == 0) std::printf("Write the full array to disk\n");
char filename[] = "u.dump";
MPI_File fh;
if(MPI_File_open(commxy, filename, MPI_MODE_CREATE | MPI_MODE_WRONLY | MPI_MODE_EXCL, MPI_INFO_NULL, &fh))
return 1;
// select noncontiguous part of 3d array to store the selected data
MPI_Offset fileoff = 0; // the offset within the file (header size)
char name[] = "native";
if(MPI_File_set_view(fh, fileoff, MPI_DOUBLE, subarray, name, MPI_INFO_NULL))
return 1;
if(MPI_File_write_all(fh, u, count, MPI_DOUBLE, MPI_STATUS_IGNORE))
return 1;
if(MPI_File_close(&fh))
return 1;
Your code looks right upon quick inspection. I would suggest that you let your MPI-IO library help tell you what's wrong: instead of returning from error, why don't you at least display the error? Here's some code that might help:
static void handle_error(int errcode, char *str)
{
char msg[MPI_MAX_ERROR_STRING];
int resultlen;
MPI_Error_string(errcode, msg, &resultlen);
fprintf(stderr, "%s: %s\n", str, msg);
MPI_Abort(MPI_COMM_WORLD, 1);
}
Is MPI_SUCCESS guaranteed to be 0? I'd rather see
errcode = MPI_File_routine();
if (errcode != MPI_SUCCESS) handle_error(errcode, "MPI_File_open(1)");
Put that in and if you are doing something tricky like setting a file view with offsets that are not monotonically non-decreasing, the error string might suggest what's wrong.

Resources