Torch, how to use Multiple GPU for different dataset

Torch, how to use Multiple GPU for different dataset - pytorch

Assume that I have 4 different datasets and 4 GPU like below
4 dataset
dat0 = [np.array(...)], dat1 = [np.array(...)] , dat2 = [np.array(...)] , dat3 = [np.array(...)]
4 GPU
device = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
assume all the four data set have already converted into tensor and transfer to 4 different GPU.
Now, I have a function f from other module which can be used on GPU
How can I do the following at the same time,
compute 4 resulf of this
ans0 = f(dat0) on device[0], ans1 = f(dat1) on device[1], ans2 = f(dat2) on device[2], ans3 = f(dat3) on device[3]
then move all the 4 ans back to cpu then calculate the sum
ans = ans0 + ans1 + ans2 + ans3

Assuming you only need ans for inference. You can easily perform those operations but you will certainly need function f to be on all four GPUs at the same time.
Here is what I would try: duplicate f four times and send to each GPU. Then compute the intermediate result, sending back each result to the CPU for the final operation:
fns = [f.clone().to(device) for device in devices]
results = []
for fn, data in zip(fns, datasets):
result = fn(data).detach().cpu()
results.append(result)
ans = torch.stack(results).sum(dim=0)

Related

is it possible to broadcast torch.MultiHeadAttention?

Suppose I have tensor of shape B,groups,npoints,features, i.e. 1,4,16,64 would represent a single batch with 4 "groups" of 16 points, each with 64 dim features and an attention module attn = torch.MultiHeadAttention(64,nheads)
I would like to share attn across all groups of points.
Reshaping the input to B,groups*npoints,features and then applying attn is not the same, as features from different groups will attend to each other.
While I can do the following:
B = 3
groups = 4
npoints = 16
feat = 64
nhead = 8
attn = torch.nn.MultiheadAttention(feat,8)
input = torch.rand((B,groups,npoints,feat))
results = []
for b in range(B):
# input[b] has shape groups,npoints,feat
out,weights = attn(input[b],input[b],input[b])
results.append(out)
# B,groups,npoints,feat
final_features = torch.stack(results)
I was just wondering if there was more efficient way of doing this.

How to get near optimal parallel efficiency for this simple Julia code?

I have the following simple code:
function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
minsofar = fill(n, Threads.nthreads())
# minsofar = n
Threads.#threads for i in 1:N
# for i in 1:N
for j in i+1:N
dist = hamming4(strings[i], strings[j])
if dist < minsofar[Threads.threadid()]
minsofar[Threads.threadid()] = dist
end
end
end
return minimum(minsofar)
#return minsofar
end
function ave_min(n, N)
ITER = 10
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = new_min
# print("New min ", new_min, ". New ave ", avesofar, "\n")
total = avesofar
for i in 1:ITER-1
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = avesofar*(i/(i+1)) + new_min/(i+1)
print("Iteration ", i, ". New min ", new_min, ". New ave ", round(avesofar; digits=2), "\n")
end
return avesofar
end
N = 2^16
n = 99
print("Overall average ", ave_min(n, N), "\n")
When I run it on an AMD 8350 in linux the CPU usage is around 430% (instead of close to 800%).
Is it possible to make the parallelisation work more efficiently?
Also, I noticed a new very impressive looking package called LoopVectorization.jl. As I am computing the Hamming distance in what looks like a vectorizable way, is it possible to speed up the code this way too?
Can the code be vectorized using LoopVectorization.jl?
(I am completely new to Julia)

The parallelization of your code seems to be correct.
Most likely you are running it in Atom or other IDE. Atom by default is using only half o cores (more exactly using only physical not logical cores).
Eg.g running in Atom on my machine:
julia> Threads.nthreads()
4
What you need to do is to explicitly set JULIA_NUM_THREADS
Windows command line (still assuming 8 logical cores)
set JULIA_NUM_THREADS=8
Linux command line
export JULIA_NUM_THREADS=8
After doing that your code takes 100% on all my cores.
EDIT
After discussion - you can get the time down to around 20% of a single threaded time by using Distributed instead of Threads since this avoids memory sharing:
The code will look more or less like this:
using Distributed
addprocs(8)
#everywhere function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
return #distributed (min) for i in 1:N-1
minimum(hamming4(strings[i], strings[j]) for j in i+1:N)
end
end
### ... the rest of code remains unchanged

Karatsuba recursive code is not working correctly

I want to implement Karatsuba multiplication algorithm in python.But it is not working completely.
The code is not working for the values of x or y greater than 999.For inputs below 1000,the program is showing correct result.It is also showing correct results on base cases.
#Karatsuba method of multiplication.
f = int(input()) #Inputs
e = int(input())
def prod(x,y):
r = str(x)
t = str(y)
lx = len(r) #Calculation of Lengths
ly = len(t)
#Base Case
if(lx == 1 or ly == 1):
return x*y
#Other Case
else:
o = lx//2
p = ly//2
a = x//(10*o) #Calculation of a,b,c and d.
b = x-(a*10*o) #The Calculation is done by
c = y//(10*p) #calculating the length of x and y
d = y-(c*10*p) #and then dividing it by half.
#Then we just remove the half of the digits of the no.
return (10**o)*(10**p)*prod(a,c)+(10**o)*prod(a,d)+(10**p)*prod(b,c)+prod(b,d)
print(prod(f,e))
I think there are some bugs in the calculation of a,b,c and d.

a = x//(10**o)
b = x-(a*10**o)
c = y//(10**p)
d = y-(c*10**p)
You meant 10 to the power of, but wrote 10 multiplied with.
You should train to find those kinds of bugs yourself. There are multiple ways to do that:
Do the algorithm manually on paper for specific inputs, then step through your code and see if it matches
Reduce the code down to sub-portions and see if their expected value matches the produced value. In your case, check for every call of prod() what the expected output would be and what it produced, to find minimal input values that produce erroneous results.
Step through the code with the debugger. Before every line, think about what the result should be and then see if the line produces that result.

How to use np.where in another np.where (conext: ray tracing)

The question is: how to use two np.where in the same statement, like this (oversimplified):
np.where((ndarr1==ndarr2),np.where((ndarr1+ndarr2==ndarr3),True,False),False)
To avoid computing second conditional statement if the first is not reached.
My first objective is to find the intersection of a ray in a triangle, if there is one. This problem can be solved by this algorithm (found on stackoverflow):
def intersect_line_triangle(q1,q2,p1,p2,p3):
def signed_tetra_volume(a,b,c,d):
return np.sign(np.dot(np.cross(b-a,c-a),d-a)/6.0)
s1 = signed_tetra_volume(q1,p1,p2,p3)
s2 = signed_tetra_volume(q2,p1,p2,p3)
if s1 != s2:
s3 = signed_tetra_volume(q1,q2,p1,p2)
s4 = signed_tetra_volume(q1,q2,p2,p3)
s5 = signed_tetra_volume(q1,q2,p3,p1)
if s3 == s4 and s4 == s5:
n = np.cross(p2-p1,p3-p1)
t = np.dot(p1-q1,n) / np.dot(q2-q1,n)
return q1 + t * (q2-q1)
return None
Here are two conditional statements:
s1!=s2
s3==s4 & s4==s5
Now since I have >20k triangles to check, I want to apply this function on all triangles at the same time.
First solution is:
s1 = vol(r0,tri[:,0,:],tri[:,1,:],tri[:,2,:])
s2 = vol(r1,tri[:,0,:],tri[:,1,:],tri[:,2,:])
s3 = vol(r1,r2,tri[:,0,:],tri[:,1,:])
s4 = vol(r1,r2,tri[:,1,:],tri[:,2,:])
s5 = vol(r1,r2,tri[:,2,:],tri[:,0,:])
np.where((s1!=s2) & (s3+s4==s4+s5),intersect(),False)
where s1,s2,s3,s4,s5 are arrays containing the value S for each triangle. Problem is, it means I have to compute s3,s4,and s5 for all triangles.
Now the ideal would be to compute statement 2 (and s3,s4,s5) only when statement 1 is True, with something like this:
check= np.where((s1!=s2),np.where((compute(s3)==compute(s4)) & (compute(s4)==compute(s5), compute(intersection),False),False)
(to simplify explanation, I just stated 'compute' instead of the whole computing process. Here, 'compute' is does only on the appropriate triangles).
Now of course this option doesn't work (and computes s4 two times), but I'd gladly have some recommendations on a similar process

Here's how I used masked arrays to answer this problem:
loTrue= np.where((s1!=s2),False,True)
s3=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t0), r0t1)),mask=loTrue)
s4=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t1), r0t2)),mask=loTrue)
s5=ma.masked_array(np.sign(dot(np.cross(r0r1, r0t2), r0t0)),mask=loTrue)
loTrue= ma.masked_array(np.where((abs(s3-s4)<1e-4) & ( abs(s5-s4)<1e-4),True,False),mask=loTrue)
#also works when computing s3,s4 and s5 inside loTrue, like this:
loTrue= np.where((s1!=s2),False,True)
loTrue= ma.masked_array(np.where(
(abs(np.sign(dot(np.cross(r0r1, r0t0), r0t1))-np.sign(dot(np.cross(r0r1, r0t1), r0t2)))<1e-4) &
(abs(np.sign(dot(np.cross(r0r1, r0t2), r0t0))-np.sign(dot(np.cross(r0r1, r0t1), r0t2)))<1e-4),True,False)
,mask=loTrue)
Note that the same process, when not using such approach, is done like this:
s3= np.sign(dot(np.cross(r0r1, r0t0), r0t1) /6.0)
s4= np.sign(dot(np.cross(r0r1, r0t1), r0t2) /6.0)
s5= np.sign(dot(np.cross(r0r1, r0t2), r0t0) /6.0)
loTrue= np.where((s1!=s2) & (abs(s3-s4)<1e-4) & ( abs(s5-s4)<1e-4) ,True,False)
Both give the same results, however, when looping on this process only for 10k iterations, NOT using masked arrays is faster! (26 secs without masked arrays, 31 secs with masked arrays, 33 when using masked arrays in one line only (not computing s3,s4 and s5 separately, or computing s4 before).
Conclusion: using nested arrays is solved here (note that the mask indicates where it won't be computed, hence first loTri must bet set to False (0) when condition is verified). However, in that scenario, it's not faster.

I can get a small speedup from short circuiting but I'm not convinced it is worth the additional admin.
full computation 4.463818839867599 ms per iteration (one ray, 20,000 triangles)
short ciruciting 3.0060838296776637 ms per iteration (one ray, 20,000 triangles)
Code:
import numpy as np
def ilt_cut(q1,q2,p1,p2,p3):
qm = (q1+q2)/2
qd = qm-q2
p12 = p1-p2
aux = np.cross(qd,q2-p2)
s3 = np.einsum("ij,ij->i",aux,p12)
s4 = np.einsum("ij,ij->i",aux,p2-p3)
ge = (s3>=0)&(s4>=0)
le = (s3<=0)&(s4<=0)
keep = np.flatnonzero(ge|le)
aux = p1[keep]
qpm1 = qm-aux
p31 = p3[keep]-aux
s5 = np.einsum("ij,ij->i",np.cross(qpm1,p31),qd)
ge = ge[keep]&(s5>=0)
le = le[keep]&(s5<=0)
flt = np.flatnonzero(ge|le)
keep = keep[flt]
n = np.cross(p31[flt], p12[keep])
s12 = np.einsum("ij,ij->i",n,qpm1[flt])
flt = np.abs(s12) <= np.abs(s3[keep]+s4[keep]+s5[flt])
return keep[flt],qm-(s12[flt]/np.einsum("ij,ij->i",qd,n[flt]))[:,None]*qd
def ilt_full(q1,q2,p1,p2,p3):
qm = (q1+q2)/2
qd = qm-q2
p12 = p1-p2
qpm1 = qm-p1
p31 = p3-p1
aux = np.cross(qd,q2-p2)
s3 = np.einsum("ij,ij->i",aux,p12)
s4 = np.einsum("ij,ij->i",aux,p2-p3)
s5 = np.einsum("ij,ij->i",np.cross(qpm1,p31),qd)
n = np.cross(p31, p12)
s12 = np.einsum("ij,ij->i",n,qpm1)
ge = (s3>=0)&(s4>=0)&(s5>=0)
le = (s3<=0)&(s4<=0)&(s5<=0)
keep = np.flatnonzero((np.abs(s12) <= np.abs(s3+s4+s5)) & (ge|le))
return keep,qm-(s12[keep]/np.einsum("ij,ij->i",qd,n[keep]))[:,None]*qd
tri = np.random.uniform(1, 10, (20_000, 3, 3))
p0, p1 = np.random.uniform(1, 10, (2, 3))
from timeit import timeit
A,B,C = tri.transpose(1,0,2)
print('full computation', timeit(lambda: ilt_full(p0[None], p1[None], A, B, C), number=100)*10, 'ms per iteration (one ray, 20,000 triangles)')
print('short ciruciting', timeit(lambda: ilt_cut(p0[None], p1[None], A, B, C), number=100)*10, 'ms per iteration (one ray, 20,000 triangles)')
Note that I played a bit with the algorithm, so this may not in every edge case give the same result aas yours.
What I changed:
I inlined the tetra volume, which allows to save a few repeated subcomputations
I replace one of the ray ends with the midpoint M of the ray. This saves computing one tetra volume (s1 or s2) because one can check whether the ray crosses the triangle ABC plane by comparing the volume of tetra ABCM to the sum of s3, s4, s5 (if they have the same signs).

How to implement simple parallel computation on different cores using MPI in Python

I want to implement a simple computation task in parallel. Lets say I have two arrays including 2 components in each and I want to sum the components of these array one by one and store them in a new array. There are 4 combinations of the components (2x2). A simple code could be written in serial that uses only 1 core and the summing operation is implemented 4 times on that core. Here is the code:
a = [1 , 5]
b = [10 , 20]
d = []
for i in range(2):
for j in range(2):
c = a[i] + b[j]
d.append(c)
print (d)
Now I want to use MPI to run the above code in parallel to use 4 different cores on my PC to make it faster. With that being said I want each combination to be implemented on the assigned core (e.g. 4 summing operations on 4 different cores). Here is how I can import MPI:
from mpi4py import MPI
mpi_comm = MPI.COMM_WORLD
rank_process = mpi_comm.rank
I have never used parallel computations so it looks a bit confusing to me. I was wondering if someone could help me with this. Thanks in advance for your time.

You can use a Create_cart to assign MPI processes to parts of a matrix so they can be given indices i and j as in your serial example. Here is the solution,
from mpi4py import MPI
mpi_comm = MPI.COMM_WORLD
rank = mpi_comm.rank
root = 0
#Define data
a = [1 , 5]
b = [10 , 20]
d = []
#Print serial solution
if rank == 0:
for i in range(2):
for j in range(2):
c = a[i] + b[j]
d.append(c)
print("Serial soln = ", d)
#Split domain into 2 by 2 comm and run an element on each process
cart_comm = mpi_comm.Create_cart([2, 2])
i, j = cart_comm.Get_coords(rank)
d = a[i] + b[j]
#Print solns on each process, note this will be jumbled
# as all print as soon as they get here
print("Parallel soln = ", d)
# Better to gather and print on root
ds = mpi_comm.gather(d, root=root)
if rank == root:
print("Parallel soln gathered = ", ds)
where you get something like,
('Serial soln = ', [11, 21, 15, 25])
('Parallel ('Parallel soln = '('Parallel soln = 'soln = ', 11)
, 21)
('Parallel soln = ', 15)
, 25)
('Parallel soln gathered = ', [11, 21, 15, 25])
where the parallel output is jumbled. Note you need to run with mpiexec as follows,
mpiexec -n 4 python script_name.py
where script_name.py is your script name.
I'm not sure this is a very good example of how you'd use MPI. It is worth reading up on MPI in general and look at some of the canonical examples. In particular, note that as each processes is independent with its own data, you should be working on problems where you can split it up into parts. In your example, all of the lists a and b are defined individually on every process and each processes only uses a small part of them. The only difference between each processes is the rank (0 to 3) and later the 2D cartesian indixes from create_cart, which determines which part of the "global" array they use.
A better solution, closer to how you'd use this in practice may be to scatter parts of a large matrix to many processes, do some work using that part of the matrix and gather back the solution to get the complete matrix (again, see examples which cover this sort of thing).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Torch, how to use Multiple GPU for different dataset - pytorch

Related

is it possible to broadcast torch.MultiHeadAttention?

How to get near optimal parallel efficiency for this simple Julia code?

Karatsuba recursive code is not working correctly

How to use np.where in another np.where (conext: ray tracing)

How to implement simple parallel computation on different cores using MPI in Python

Categories

Resources