OpenCL data Synchronization

OpenCL data Synchronization - multithreading

I am multiplying a row of a matrix with the inverse of the principle diagonal element of that row. I have implemented it with 1-D parallel code. All the thread runs this code
1.read the principle diagonal element
2.calculate the inverse of that element
3.multiply inverse with the element indexed at the thread id
The problem arises when ith thread in ith row executes step 3 before other thread executes step 1. It changes the value of the principle diagonal element before others can read it.
Does OpenCL have any barrier like thing which only allows a thread to execute step 3 after all threads executes step 1?
I don't want to use empty loops because there can be worst cases when it can get
failed.

One way is to add barrier(CLK_CL_LOCAL_MEM_FENCE) .
The other way is sperating the work in two kernels, but you can pass the cl_mem
computed from step1's kernel directly to step3's kernel.This won't cause CPU/GPU IO.
The diagonal matrix multiplies a dense matrix is a set of dot product which can be done by using reduce. That will make your function faster.

Related

cuda kernel 'volta_sgemm_128x32_nn' means what?

I am studying the nvidia torch matmul function.
### variable creation
a = torch.randn(size=(1,128,3),dtype=torch.float32).to(cuda)
b = torch.randn(size=(1,3,32),dtype=torch.float32).to(cuda)
### execution
c = torch.matmul(a,b)
I profiled this code using pyprof and this gives me the result below.
I cannot understand many things in there.
what is sgemm_128_32 means?
I see the 's' in sgemm stands for single precision and 'gemm' means general matrix multiplication. But i don't know the 128_32 means. My output matrix dimension is 128 by 32. But I know that cutlass optimizes the sgemm using outer product. (i will give you the link, ref 1) Actually i cannot understand the link.
(1)Does 128_32 means simply the output matrix's dimension?
(2)Is there any way how my output matrix(c, in my code) is actually calculated?
(for example, there are total 128*32 threads. And each thread is responsible for one output element using inner product way)
Why the Grid and Block have 3 dimension each and how the grid and block is used for sgemm_128_32?
Grid consists of x, y, z. And Block consists of x, y, z. (1) Why do you need 3 dimension? I see that (in the picture above) block X has 256 thread. (2) is this true? And Grid Y is 4. so this means that there is 4 blocks in Grid Y. (3) is this true?
By using that pyprof result, can i figure out how many SMs are used? how many warps are activated in that SM?
Thank you.
ref 1 : https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/

Parallelizing search in a 2D array on CUDA

I have a 500 x 500 2D array of floats. I wish to search in the vertical and horizontal directions from the middle of the array for the first zero element in both directions. The output should be 4 indices for the first zero element in the North, South, East and West directions. Is there a way to parallelize this search operation on CUDA.
Thanks.

(This answer assumes that you are not searching entire quadrants, but only the straight lines in each direction)
1. In case the array is in CPU memory
In fact, you have a search space of just 1,000 elements. The overhead of copying the data, launching the kernel and waiting for the result is such that it is not worth your trouble.
Do it on the CPU. One of your axes already has the data nicely laid out, consecutively; probably best to work on that axis first. The other axis will be a bitch in terms of memory access, but that's life. You could go multi-threaded here, but I'm not sure it's worth your trouble for so little work. If you did, each thread would wait on its own element.
As far as the algorithm - since your data isn't sorted, it's basically a linear search (up to vectorization). If you've gone multi-threaded - perhaps use a shared variable which a thread occasionally polls to see if an "closer-to-the-center" thread has found a zero yet; and when a thread finds a zero, it updates that variable to let other threads know to stop working.
2. In case the array is in GPU global memory
Now you get lots of (CUDA) 'threads'. So, it makes less sense to use an atomic variable, or polling etc.
We treat each of the four directions separately (although it doesn't have to be 4 separate kernels).
As #RobertCrovella notes, you can treat this problem as a parallel reduction, with each thread assigned an input element: Initially, each thread holds a value of infinity (if its corresponding element is non-zero), or its distance from the center if its corresponding array value is 0. Now, the reduction operator is "minimum".
This is not entirely optimal, because when warp or block results are collected (as part of a parallel reduction), this problem allows for short-circuiting when the lowest non-infinity value is located. You can read up how parallel reduction is implemented - but I really wouldn't bother, because you have a very small amount of computational work here.
Note: It is also possible that your array is in GPU array memory. In that case you would get better locality in both dimensions

It's not really clear how you define "first zero element in the North, South, East and West directions" but I could imagine a rectangular data set broken into 4 quadrants along the diagonals.
We could label the top region the "north region" and we could label the other regions similarly.
with that assumption, In the worst case you have to check every element of the array.
Therefore one possible approach is a parallel reduction.
You would then do a parallel reduction on each region, such that the distance from the center (using the standard distance formula) is minimized, considering the zero elements in the region.
If you are actually only interested in the elements associated with the vertical axis and horizontal axis that pass through the center of the image, then another approach may be better.
Even in that case, I think a parallel reduction would be a typical approach, two for each axis, considering only the zero elements on the axis half.

Bellman-Ford :- Why are there N-1 iterations for calculating mindistance?

def calculateShortestPath(self,vertexList,edgeList,startVertex):
startVertex.minDistance=0
for i in range(0,len(vertexList)-1):#N-1 ITERATION
for edge in edgeList:
#RELAXATION PROCESS
u=edge.startVertex
v=edge.targetVertex
newDistance=u.minDistance+edge.weight
if newDistance<v.minDistance:
v.minDistance=newDistance
v.predecessor=u
for edge in edgeList:# FINAL ITERATION TO DETECT NEGATIVE CYCLES
if self.hasCycle(edge):
print("NEGATIVE CYCLE DETECTED")
self.HAS_CYCLE=True
return
The above function is a part of the implementation of the Bellman-Ford Algorithm. My question is that how can one be sure that after N-1 iterations , the minimum distance has been calculated ? In case of Dijkstra it was understood that once the priority queue has gone empty all the shortest paths have been created but I can't understand the reasoning behind the N-1 over here.
N-Length of the Vertex List.
Vertex List-contains the different vertex.
EdgeList-List of the different Edges.
The implementation may be wrong since I read it from a tutorial video.Thanks For The Help

The outer loop executes N-1 times, because the shortest path can not contain more edges, otherwise the shortest path will contain a loop which can be avoided.
Minor: if you have N vertexes and N edges then at least 1 vertex is used twice, so such a path will contain a loop.

The algorithm unlike Djkstra, is not greedy, but dynamic. In the first iteration of the loops it builds one possible path between two vertex and then at each iteration it improves the path by at least one edge. As the shortest path can use maximum n-1 edges, the iteration of the loop continues that much to find the shortest path.
For negative cycle, algorithm at nth iteration checks one more time to see if an edge exists to decrease the weight of the shortest path having n-1 edges. If yes, then that edge must be negative as the shortest path with all positive edges should be consisted of n-1 not n edges.

You can take any graph and make sure that it does not have a negative edge sum cycle and taking the right order of edges (selecting edge with source vertex according to topological sort order) you can come to the answer in only one iteration by just relaxing every edge once.
The n-1 term comes when we take into consideration that we do not take edges in a logical pattern and we process it in a random fashion.

A multithreaded algorithm for multiplying an nxn matrix with $\Theta(n^2/lgn)$ parallelism

I want to find a multithreaded algorithm to multiply an $n x n$ matrix by an n-vector that achieves $\Theta(n^2/lgn)$ parallelism while maintaining $\Theta(n^2)$ work.
I know an illegal solution but any tips on how to make the span go down to $\Theta(lgn)$?

There is an implementation of this problem with procedure named MAT-VEC in CLRS textbook. But its span is Theta of N. To pull it down to logarithmic span, you can replace serial summation in inner for loop by using multithreaded divide & conquer strategy. To do that recursively divide the range and spawn one side with parallel to other then synchronize and return the summed value left+right.

Binomial coefficient

'Simple' question, what is the fastest way to calculate the binomial coefficient? - Some threaded algorithm?
I'm looking for hints :) - not implementations :)

Well the fastest way, I reckon, would be to read them from a table rather than compute them.
Your requirements on integer accuracy from using a double representation means that C(60,30) is all but too big, being around 1e17, so that (assuming you want to have C(m,n) for all m up to some limit, and all n<=m), your table would only have around 1800 entries. As for filling the table in I think Pascal's triangle is the way to go.

According to the equation below (from wikipedia) the fastest way would be to split the range i=1,k to the number of threads, give each thread one range segment, and each thread updates the final result in a lock. "Academic way" would be to split the range into tasks, each task being to calculate (n - k + i)/i, and then no matter how many threads you have, they all run in a loop asking for next task. First is faster, second is... academic.
EDIT: further explanation - in both ways we have some arbitrary number of threads. Usually the number of threads is equal to the number of processor cores, because there is no benefit in adding more threads. The difference between two ways is what those threads are doing.
In first way each thread is given N, K, I1 and I2, where I1 and I2 are the segment in the range 1..K. Each thread then has all the data it neads, so it calculates its part of the result, and upon finish updates the final result.
In second way each thread is given N, K, and access to some syncronized counter that counts from 1 to K. Each thread then aquires one value from this shared counter, calculates one fraction of the result, updates the final result, and loops on this until counter informs the thread that there are no more items. If it happens that some processor cores are faster that others then this second way will put all cores to maximum use. Downside to second way is too much synchronization that effectively blocks, say, 20% of threads all the time.

Hint: You want to do as little multiplications as possible. The formula is n! / (k! * (n-k)!). You should do less than 2m multiplications, where m is the minimum of k and n-k. If you want to work with (fairly) big numbers, you should use a special class for the number representation (Java has BigInteger for instance).

Here's a way that never overflows if the final result is expressible natively in the machine, involves no multiplications/factorizations, is easily parallelized, and generalizes to BigInteger-types:
First note that the binomial coefficients satisfy following:
.
This yields a straightforward recursion for computing the coefficient: the base cases are and , both of which are 1.
The individual results from the subcalls are integers and if \binom{n}{k} can be represented by an int, they can too; so, overflow is not a concern.
Naively implemented, the recursion leads to repeated subcalls and exponential runtimes.
This can be fixed by caching intermediate results. There are
n^2 subproblems, which can be combined in O(1) time, yielding an O(n^2) complexity bound.

This answer calculates binomial with Python:
def h(a, b, c):
x = 0
part = str("=")
while x < (c+1):
nCr = math.comb(c,x)
part = part+'+'+str(int(a**(c-1))*int(b**x)*int(nCr)+'x^'+str(x)
x = x+1
print(part)
h(2,6,4)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string