Suppose I have a 2 dimension vector
vector< vector< int> > v
Then thread 1 operates on v[0], like
v[0].push_back(x)
thread 2 operates on v[1],
v[1].remove(y)
etc.
Is this operation thread safe? I suppose it is, as v[0] and v[1] are separate pointers which point to different memory address? Will these two pointers finally overlap?
Use concurrent_vector provided by Intel.
https://software.intel.com/en-us/node/506203
Related
I have an interval-treeish algorithm I would like to run in parallel for many queries using threads. Problem is that then each thread would need its own array, since I cannot know in advance how many hits there will be.
There are other questions like this, and the solution suggested is always to have an array of size (K, t) where K is output length and t is number of threads. This does not work for me as K might be different for each thread and each thread might need to resize the array to fit all the results it gets.
Pseudocode:
for i in prange(len(starts)):
qs, qe, qx = starts[i], ends[i], index[i]
results = t.search(qs, qe)
if len(results) + nfound < len(output):
# add result to output
else:
# resize array
# then add results
An usual pattern is that every thread gets its own container, which is a trade-off between speed/complexity and memory-overhead:
there is no need to lock for access to this container, because only one thread accesses it.
there is much less overhead compared to "own container for every task (i.e. every i-value)".
After the parallel section, the data must be either collected in a final container in a post processing step (which also could happen in parallel) or the subsequent algorithms should be able to handle a collection of containers.
Here is an example using c++-vector (which already has memory management and increasing size built-in):
%%cython -+ -c=/openmp --link-args=/openmp
from cython.parallel import prange, threadid
from libcpp.vector cimport vector
cimport openmp
def calc_in_parallel(N):
cdef int i,k,tid
cdef int n = N
cdef vector[vector[int]] vecs
# every thread gets its own container
vecs.resize(openmp.omp_get_max_threads())
for i in prange(n, nogil=True):
tid = threadid()
for k in range(i):
# use container of the thread
vecs[tid].push_back(k) # dummy for calculation
return vecs
Using omp_get_max_threads() for the number of threads will overestimate the real number of threads in many cases. It is probably more robust to set the number of threads explicitly in prange, i.e.
...
NUM_THREADS = 2
vecs.resize(NUM_THREADS)
for i in prange(n, nogil=True, num_threads = NUM_THREADS):
...
A similar approach can be applied using pure C, but more boiler plate code (memory management) will be needed in this case.
I am trying to parallelize the following snippet of code using OpenMP. The portion shown is basically a version of the Thomas matrix algorithm (matrix inversion) for implicit solver in fluid flow/turbulence simulations. I have stripped it down from its exact form to make the question easily understandable.
REAL(KIND=DP), INTENT(INOUT), DIMENSION(1:10) :: A, C_old, C_new, D
INTEGER :: K
!$OMP PARALLEL DO SCHEDULE(STATIC) NUM_THREADS(3)&
!$OMP SHARED(A, C_old, C_new, D)
DO K = 2,9
C_new(K) = C_old(K)/(A(K)*C_new(K-1))
END DO
!$OMP END PARALLEL
C(1) = C(10) = 0 because they are at the fixed boundaries (top and the bottom wall of a channel). Hence there is no need to update them.
As it is evident, the calculation of C(2) needs the information of C(1) and the calculation of C(3) needs the information of the MODIFIED C(2) and so on. Or in other words, it is a forward sweep (the algorithm is here https://en.wikipedia.org/wiki/Tridiagonal_matrix_algorithm). Hence, for three (03) number of threads, I want the distribution of iterations across the threads to be [1,2,3,4], [4,5,6,7], [7,8,9,10]. But if I just do a SCHEDULE STATIC, I would not be able to control this and the distribution might be [1,2,3], [4,5,6], [7,8,9,10] across 3 threads. This would give erroneous results as the information of the grid points 4 and 7 are not passed on to the other threads.
Please let me know if there is a way where I can force the indices to be solved by a certain thread without any race? Say,
thread #1 solve indices [1,2,3,4]
thread #2 solve indices [4,5,6,7] etc.
Is there a difference between thread conflict and data race.
As per what I've learnt conflicting operations between occur when two threads try to access the same memory location and atleast one of them is a write operation.
Here is what Wikipedia has to say about data race/ race condition.
How are they different?
I have finally found a good answer to this question.
TL:DR :-
Conflicting operations -
Involve multiple threads
Accessing the same memory location
At least one of them is a write operation.
Data race - unordered conflicting operations.
LONG VERSION -
I am explaining with an example how conflicting operations occur and how to identify if they are data race free.
Consider Thread 1 and Thread 2, and shared variable done.
AtomicBoolean done = new AtomicBoolean(false);
int x = 0;
Thread 1
x = f();
done.set(true);
Thread 2
while(!done.get())
{
/* a block */
}
y = g(x);
here done.set() - done.get() and x=f() - y=g(x) are in conflict. However the programming memory model defines 2 relations :- synchronizes-with and happens-before. Because the done is atomic, its pair of operations synchronize with each other. Additionally, because of that we can choose which operation happens before the other in that pair.
Now because x = f() happens before done.set(true) in Thread 1 and done.get() happens before y = g(x) in Thread 2, we can say x = f() happens before y = g(x) because happens before is a transitive relation.
Thus the above example is ordered and consequently data-race free.
I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a.
Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically.
This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow discussion - with no answer for the intended question - and this largely unanswered MathWorks thread).
This is a rather unhappy surprise for me.
We can verify that multithreading has no effects for sparse matrix operations by the following code:
clc; clear all;
N = 5000; % set matrix sizes
M = 3000;
A = randn(N,M); % create dense random matrices
B = sprand(N,N,0.015); % create sparse random matrix
Bf = full(B); %create a dense form of the otherwise sparse matrix B
for i=1:3 % test for 1, 2, and 4 threads
m(i) = 2^(i-1);
maxNumCompThreads(m(i)); % set the thread count available to Matlab
tic % starts timer
y = B*A;
walltime(i) = toc; % wall clock time
speedup(i) = walltime(1)/walltime(i);
end
% display number of threads vs. speed up relative to just a single thread
[m',speedup']
This produces the following output, which illustrates that there is no difference between using 1, 2, and 4 threads for sparse operations:
threads speedup
1.0000 1.0000
2.0000 0.9950
4.0000 1.0155
If, on the other hand, I replace B by its dense form, refered to as Bf above, I get significant speedup:
threads speedup
1.0000 1.0000
2.0000 1.8894
4.0000 3.4841
(illustrating that matrix operations for dense matrices in Matlab are indeed implicitly parallelized)
So, my question: is there any way at all to access a parallelized/threaded version of matrix operations for sparse matrices (in Matlab) without converting them to dense form?
I found one old suggestion involving .mex files at MathWorks, but it seems the links are dead and not very well documented/no feedback? Any alternatives?
It seems to be a rather severe restriction of implicit parallelism functionality, since sparse matrices are abound in computationally heavy problems, and hyperthreaded functionality highly desirable in these cases.
MATLAB already uses SuiteSparse by Tim Davis for many of its operation on sparse matrices (for example see here), but neither of which I believe are multithreaded.
Usually computations on sparse matrices are memory-bound rather than CPU-bound. So even you use a multithreaded library, I doubt you will see huge benefits in terms of performance, at least not comparable to those specialized in dense matrices...
After all that the design of sparse matrices have different goals in mind than regular dense matrices, where efficient memory storage is often more important.
I did a quick search online, and found a few implementations out there:
sparse BLAS, spBLAS, PSBLAS. For instance, Intel MKL and AMD ACML do have some support for sparse matrices
cuSPARSE, CUSP, VexCL, ViennaCL, etc.. that run on the GPU.
I ended up writing my own mex file with OpenMP for multithreading. Code as follows. Don't forget to use -largeArrayDims and /openmp (or -fopenmp) flags when compiling.
#include <omp.h>
#include "mex.h"
#include "matrix.h"
#define ll long long
void omp_smm(double* A, double*B, double* C, ll m, ll p, ll n, ll* irs, ll* jcs)
{
for (ll j=0; j<p; ++j)
{
ll istart = jcs[j];
ll iend = jcs[j+1];
#pragma omp parallel for
for (ll ii=istart; ii<iend; ++ii)
{
ll i = irs[ii];
double aa = A[ii];
for (ll k=0; k<n; ++k)
{
C[i+k*m] += B[j+k*p]*aa;
}
}
}
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *A, *B, *C; /* pointers to input & output matrices*/
size_t m,n,p; /* matrix dimensions */
A = mxGetPr(prhs[0]); /* first sparse matrix */
B = mxGetPr(prhs[1]); /* second full matrix */
mwIndex * irs = mxGetIr(prhs[0]);
mwIndex * jcs = mxGetJc(prhs[0]);
m = mxGetM(prhs[0]);
p = mxGetN(prhs[0]);
n = mxGetN(prhs[1]);
/* create output matrix C */
plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL);
C = mxGetPr(plhs[0]);
omp_smm(A,B,C, m, p, n, (ll*)irs, (ll*)jcs);
}
On matlab central the same question was asked, and this answer was given:
I believe the sparse matrix code is implemented by a few specialized TMW engineers rather than an external library like BLAS/LAPACK/LINPACK/etc...
Which basically means, that you are out of luck.
However I can think of some tricks to achieve faster computations:
If you need to do several multiplications: do multiple multiplications at once and process them in parallel?
If you just want to do one multiplication: Cut the matrix into pieces (for example top half and bottom half), do the calculations of the parts in parallel and combine the results afterwards
Probably these solutions will not turn out to be as fast as properly implemented multithreading, but hopefully you can still get a speedup.
How can I determine if the following memory access is coalesced or not:
// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;
while ( idx < NUMELEMENTS )
{
// Do Something
// ....
// Write to Array which contains results of calculations
results[ idx ] = df2;
// Next Element
idx += offset;
}
NUMELEMENTS is the complete number of single dataelements to process. The array results is passed as pointer to the kernel function and allocated before in global memory.
My Question: Is the write access in the line results[ idx ] = df2; coalesced?
I believe it is as each thread processes consecutive indexed items but I'm not completely sure about it & I don't know how to tell.
Thanks!
Depends if the length of the lines of your matrix is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x. If it is not you can use padding to make it fully coalesced. The function cudaMallocPitch can be used for this purpose.
edit:
Sorry for the confusion. You write 'offset' elements at a time which I interpreted as lines of a matrix.
What I mean is, after each iteration of your cycle you increase the idx by offset. If offset is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x then you it is coalesced, if not then you need padding to make it so.
Probably it is already coalesced because you should choose the number of threads per block and thus the blockDim as a multiple of the warp size.