Odd behavior of matrix multiplication with threading in Julia - multithreading

I am trying to do some linear algebra in Threads.#threads for loop, and it returns some weird results. It seems that matrices are not multiplied properly in the loop. Is it safe to do in a threaded for loop?
Below is a minimal working example to generate a TxR table of NxN matrices. For each of R iterations the (t+1)-th matrix is the product the (t)-th with another random matrix. The multiplications are performed by different threads, and then checked for correctness by one thread. The function should return a matrix of zeros. It does so for N<=3, however, there are a few ones in the result for N>=4.
function testMM(T, R, N)
m1 = zeros(Int64, (T,R,N,N))
m2 = rand(0:2, (T-1,R,N,N))
m1[1,:,:,:] = rand(0:1,(R,N,N))
Threads.#threads for i=1:R
for t=2:T
m1[t,i,:,:] = m2[t-1,i,:,:] * m1[t-1,i,:,:]
end
end
odds = zeros(Int64,(T-1,R))
for i=1:R
for t=2:T
if m1[t,i,:,:] != m2[t-1,i,:,:] * m1[t-1,i,:,:]
odds[t-1,i] = 1
end
end
end
return odds
end
Threads.nthreads() is 4 for me. Tested on stable 64bit Julia 0.5.2, 0.5.3, 0.6.0 on Windows.
Edit: This example is even simpler: several copies of a matrix are squared several times independently. The result should be the same for all copies, but the function usually returns false for N>=4. Looks as if the data of different threads is mixed somewhere inside BLAS.
function testMM2(T, R, N)
m0 = rand(0:2, (N,N))
m = [deepcopy(m0) for i=1:R]
Threads.#threads for i=1:R
for t=1:T
m[i] = m[i]^2
end
end
return all(x->(x==m[1]),m)
end

Related

Why I am observing no speedp in a parallelised for loop in Julia

THERE WAS AN ERROR IN MY CODE. SORRY EVERYBODY. THE REASON WAS THAT THE TIME NEEDED FOR LARGER VALUE OF X IS LARGER.
This is my code:
fit_s = rand(100)
fit_d = rand(300)
LIM = 100
x = [2^i for i in range(-2,2,40)]
y = zeros(length(x))
Threads.#threads for i in 1:length(x)
y[i] = analytic(x[i], fit_s, fit_d, LIM)
end
print(i)
my_function uses the two inputs to produce the output.
The problem is that it does not speed up the execution. With 1 threads it takes 27 s while with 8 threads 21 s(I'm considering only the time of the for loop).
What can I do ?
Thanks in advice.
Function
function alpha(i,n,m)
((n-i)/n)^m - ((n-i-1)/n)^m
end
function beta(i,n,m)
(1-i/n)^(m)
end
function gamma(s, d, fit_s, fit_d, LIM)
n_s = length(fit_s)
n_d = length(fit_d)
app = sort(fit_d, rev=true)[1:LIM]
F_DN = 0.
for i in 1:LIM
j = sum(fit_s .> app[i])
F_DN += alpha(i-1,n_d,d)*beta(j,n_s,s)
end
F_DN
end
function mine_poisson(m,i)
f = Poisson(m)
pdf(f, i)
end
function analytic(m, fit_s, fit_d, LIM)
Z = 1 - exp(-2*m)
F_DN = 0.
intm = trunc(Int, m + 0.5)
if m < 20
intervallo = 1:(intm + 20)
else
app = trunc(Int,m^0.5)*4
intervallo = (intm - app) : (intm+app)
end
for j in intervallo
for k in intervallo
p = mine_poisson(m, k)*mine_poisson(m, j) / Z
F_DN += gamma(j, k, fit_s, fit_d, LIM)*p
end
end
F_DN += exp(-m)*(1-exp(-m))/Z
end
You code looks correct.
There are the following reasons why you might not observe the speedup:
you forgot the -t parameter when starting Julia. Use Threads.nthreads() to check the actual number of threads loaded
your operating system is heavily loaded with other tasks
[most likely] your function my_function is using parallelized code via BLAS. In order to find out the BLAS configuration try:
using LinearAlgebra
LinearAlgebra.BLAS.get_num_threads()
BLAS is used for linear algebra operations such as matrix multiplications. In that case the call my_function would be using most of threads already and hence small speedup is observed
some other libraries such as DataFrames are utilizing threading support for selected operations
other issues such as false sharing or race condition (does not look to be the case in your code)

Multiply every element of matrix with a vector to obtain a matrix whose elements are vectors themselves

I need help in speeding up the following block of code:
import numpy as np
x = 100
pp = np.zeros((x, x))
M = np.ones((x,x))
arrayA = np.random.uniform(0,5,2000)
arrayB = np.random.uniform(0,5,2000)
for i in range(x):
for j in range(x):
y = np.multiply(arrayA, np.exp(-1j*(M[j,i])*arrayB))
p = np.trapz(y, arrayB) # Numerical evaluation/integration y
pp[j,i] = abs(p**2)
Is there a function in numpy or another method to rewrite this piece of code with so that the nested for-loops can be omitted? My idea would be a function that multiplies every element of M with the vector arrayB so we get a 100 x 100 matrix in which each element is a vector itself. And then further each vector gets multiplied by arrayA with the np.multiply() function to then again obtain a 100 x 100 matrix in which each element is a vector itself. Then at the end perform numerical integration for each of those vectors with np.trapz() to obtain a 100 x 100 matrix of which each element is a scalar.
My problem though is that I lack knowledge of such functions which would perform this.
Thanks in advance for your help!
Edit:
Using broadcasting with
M = np.asarray(M)[..., None]
y = 1000*arrayA*np.exp(-1j*M*arrayB)
return np.trapz(y,B)
works and I can ommit the for-loops. However, this is not faster, but instead a little bit slower in my case. This might be a memory issue.
y = np.multiply(arrayA, np.exp(-1j*(M[j,i])*arrayB))
can be written as
y = arrayA * np.exp(-1j*M[:,:,None]*arrayB
producing a (x,x,2000) array.
But the next step may need adjustment. I'm not familiar with np.trapz.
np.trapz(y, arrayB)

Parallellizing Intial-boundary value problem (Finite difference)

I am running a simulation to solve the advection diffusion equation. I wish to parallelize the part of the code where I calculate the partial derivatives so as to speed up my computation. Here is what I am doing:
p1 = np.zeros((len(r), len(th)-1 )) #The solution of the matrix
def func(i):
pti = np.zeros(len(th)-1)
for j in range (len(pti)):
abc = f(p1) #Some function calculating the derivatives at each point
pti[j] = p1[i][j] + dt*( abc ) #dt is some small float number
return pti
#Setting the initial condition of the p1 matrix
for i in range(len(p1[:,0])):
for j in range(len(p1[0])):
p1[i][j] = 0.01
#Final loop calculating the integral by finite difference scheme
p = Pool(args.cores)
for k in range (0,args.iterations): #This is integration in time
p1=p.map(func,range(len(r)))
print (p1)
The problem that I am facing here is that my p1 matrix is not updating after each iteration in k. In the end when I print p1 I get the same matrix that I initialized.
Also, the linear version of this code is working (but it takes too long).
Okay I solved this myself. Apparently putting the line
p = Pool(args.cores)
inside the loop
for k in range (0,args.iterations):
does the trick.

Element-wise variance of an iterator

What's a numerically-stable way of taking the variance of an iterator elementwise? As an example, I would like to do something like
var((rand(4,2) for i in 1:10))
and get back a (4,2) matrix which is the variance in each coefficient. This throws an error using Julia's Base var. Is there a package that can handle this? Or an easy (and storage-efficient) way to do this using the Base Julia function? Or does one need to be developed on its own?
I went ahead and implemented a Welford algorithm to calculate this:
# Welford algorithm
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
function componentwise_meanvar(A;bessel=true)
x0 = first(A)
n = 0
mean = zero(x0)
M2 = zero(x0)
delta = zero(x0)
delta2 = zero(x0)
for x in A
n += 1
delta .= x .- mean
mean .+= delta./n
delta2 .= x .- mean
M2 .+= delta.*delta2
end
if n < 2
return NaN
else
if bessel
M2 .= M2 ./ (n .- 1)
else
M2 .= M2 ./ n
end
return mean,M2
end
end
A few other algorithms are implemented in DiffEqMonteCarlo.jl as well. I'm surprised I couldn't find a library for this, but maybe will refactor this out someday.
See update below for a numerically stable version
Another method to calculate this:
srand(0) # reset random for comparing across implementations
moment2var(t) = (t[3]-t[2].^2./t[1])./(t[1]-1)
foldfunc(x,y) = (x[1]+1,x[2].+y,x[3].+y.^2)
moment2var(foldl(foldfunc,(0,zeros(1,1),zeros(1,1)),(rand(4,2) for i=1:10)))
Gives:
4×2 Array{Float64,2}:
0.0848123 0.0643537
0.0715945 0.0900416
0.111934 0.084314
0.0819135 0.0632765
Similar to:
srand(0) # reset random for comparing across implementations
# naive component-wise application of `var` function
map(var,zip((rand(4,2) for i=1:10)...))
which is the non-iterator version (or offline version in CS terminology).
This method is based on calculation of variance from mean and sum-of-squares. moment2var and foldfunc are just a helper functions, but it fits in one-line without them.
Comments:
Speedwise, this should be pretty good as well. Perhaps, StaticArrays and initializing the foldl's v0 with the correct eltype of the iterator would save even more time.
Benchmarking gave 5x speed advantage (and better memory usage) over componentwise_meanvar (from another answer) on a sample input.
Using moment2meanvar(t)=(t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1)‌​) gives both mean and variance like componentwise_meanvar.
As #ChrisRackauckas noted, this method suffers from numerical instability when number of elements to sum is large.
--- UPDATE with variant of method ---
A little abstraction of the question asks for a way to do a foldl (and reduce,foldr) on an iterator returning a matrix, element-wise and retaining shape. To do so, we can define an assisting function mfold which takes a folding-function and makes it fold matrices element-wise. Define it as follows:
mfold(f) = (x,y)->[f(t[1],t[2]) for t in zip(x,y)]
For this specific problem of variance, we can define the component-wise fold functions, and a final function to combine the moments into the variance (and mean if wanted). The code:
ff(x,y) = (x[1]+1,x[2]+y,x[3]+y^2) # fold and collect moments
moment2var(t) = (t[3]-t[2]^2/t[1])/(t[1]-1) # calc variance from moments
moment2meanvar(t) = (t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1))
We can see moment2meanvar works on a single vector as follows:
julia> moment2meanvar(foldl(ff,(0.0,0.0,0.0),[1.0,2.0,3.0]))
(2.0, 1.0)
Now to matrix-ize it using foldm (using .-notation):
moment2var.(foldl(mfold(ff),fill((0,0,0),(4,2)),(rand(4,2) for i=1:10)))
#ChrisRackauckas noted this is not numerically stable, and another method (detailed in Wikipedia) is better. Using foldm this could be implemented as:
# better fold function compensating the sums for stability
ff2(x,y) = begin
delta=y-x[2]
mean=x[2]+delta/(x[1]+1)
return (x[1]+1,mean,x[3]+delta*(y-mean))
end
# combine the collected information for the variance (and mean)
m2var(t) = t[3]/(t[1]-1)
m2meanvar(t) = (t[2],t[3]/(t[1]-1))
Again we have:
m2var.(foldl(mfold(ff2),fill((0,0.0,0.0),(4,2)),(rand(4,2) for i=1:10)))
Giving the same results (perhaps a little more accurately).
Or an easy (and storage-efficient) way to do this using the Base Julia function?
Out of curiosity, why is the standard solution of using var along the external dimension not good for you?
julia> var(cat(3,(rand(4,2) for i in 1:10)...),3)
4×2×1 Array{Float64,3}:
[:, :, 1] =
0.08847 0.104799
0.0946243 0.0879721
0.105404 0.0617594
0.0762611 0.091195
Obviously, I'm using cat here, which clearly is not very storage efficient, just so I can use the Base Julia function and your original generator syntax as per your question. But you could make this storage efficient as well, if you initialise your random values directly on a preallocated array of size (4,2,10), so that's not really an issue here.
Or did I misunderstand your question?
EDIT - benchmark in response to comments
function standard_var(Y, A)
for i in 1 : length(A)
Y[:,:,i], = next(A,i);
end
var(Y,3)
end
function testit()
A = (rand(4,2) for i in 1:10000);
Y = Array{Float64, 3}(4,2,length(A));
#time componentwise_meanvar(A); # as defined in Chris's answer above
#time standard_var(Y, A) # standard variance + using preallocation
#time var(cat(3, A...), 3); # standard variance without preallocation
return nothing
end
julia> testit()
0.004258 seconds (10.01 k allocations: 1.374 MiB)
0.006368 seconds (49.51 k allocations: 2.129 MiB)
5.954470 seconds (50.19 M allocations: 2.989 GiB, 71.32% gc time)

Parallelization of Piecewise Polynomial Evaluation

I am trying to evaluate points in a large piecewise polynomial, which is obtained from a cubic-spline. This takes a long time to do and I would like to speed it up.
As such, I would like to evaluate a points on a piecewise polynomial with parallel processes, rather than sequentially.
Code:
z = zeros(1e6, 1) ; % preallocate some memory for speed
Y = rand(11220,161) ; %some data, rand for generating a working example
X = 0 : 0.0125 : 2 ; % vector of data sites
pp = spline(X, Y) ; % get the piecewise polynomial form of the cubic spline.
The resulting structure is large.
for t = 1 : 1e6 % big number
hcurrent = ppval(pp,t); %evaluate the piecewise polynomial at t
z(t) = sum(x(t:t+M-1).*hcurrent,1) ; % do some operation of the interpolated value. Most likely not relevant to this question.
end
Unfortunately, with matrix form and using:
hcurrent = flipud(ppval(pp, 1: 1e6 ))
requires too much memory to process, so cannot be done. Is there a way that I can batch process this code to speed it up?
For scalar second arguments, as in your example, you're dealing with two issues. First, there's a good amount of function call overhead and redundant computation (e.g., unmkpp(pp) is called every loop iteration). Second, ppval is written to be general so it's not fully vectorized and does a lot of things that aren't necessary in your case.
Below is vectorized code code that take advantage of some of the structure of your problem (e.g., t is an integer greater than 0), avoids function call overhead, move some calculations outside of your main for loop (at the cost of a bit of extra memory), and gets rid of a for loop inside of ppval:
n = 1e6;
z = zeros(n,1);
X = 0:0.0125:2;
Y = rand(11220,numel(X));
pp = spline(X,Y);
[b,c,l,k,dd] = unmkpp(pp);
T = 1:n;
idx = discretize(T,[-Inf b(2:l) Inf]); % Or: [~,idx] = histc(T,[-Inf b(2:l) Inf]);
x = bsxfun(#power,T-b(idx),(k-1:-1:0).').';
idx = dd*idx;
d = 1-dd:0;
for t = T
hcurrent = sum(bsxfun(#times,c(idx(t)+d,:),x(t,:)),2);
z(t) = ...;
end
The resultant code takes ~34% of the time of your example for n=1e6. Note that because of the vectorization, calculations are performed in a different order. This will result in slight differences between outputs from ppval and my optimized version due to the nature of floating point math. Any differences should be on the order of a few times eps(hcurrent). You can still try using parfor to further speed up the calculation (with four already running workers, my system took just 12% of your code's original time).
I consider the above a proof of concept. I may have over-optmized the code above if your example doesn't correspond well to your actual code and data. In that case, I suggest creating your own optimized version. You can start by looking at the code for ppval by typing edit ppval in your Command Window. You may be able to implement further optimizations by looking at the structure of your problem and what you ultimately want in your z vector.
Internally, ppval still uses histc, which has been deprecated. My code above uses discretize to perform the same task, as suggested by the documentation.
Use parfor command for parallel loops. see here, also precompute z vector as z(j) = x(j:j+M-1) and hcurrent in parfor for speed up.
The Spline Parameters estimation can be written in Matrix form.
Once you write it in Matrix form and solve it you can use the Model Matrix to evaluate the Spline on all data point using Matrix Multiplication which is probably the most tuned operation in MATLAB.

Resources