Does Julia need vectorization to speed up computation? - python-3.x

In Python, it is usually suggested to vectorize the code to make computation faster. For example, if you want to compute the inner product of two vectors, say a and b, usually
Code A
c = np.dot(a, b)
is better than
Code B
c = 0
for i in range(len(a)):
c += a[i] * b[i]
But in Julia, it seems sometimes vectorization is not that helpful. I reckoned '* and dot as vectorized versions and an explicit for loop as a non-vectorized version and got the following result.
using Random
using LinearAlgebra
len = 1000000
rng1 = MersenneTwister(1234)
a = rand(rng1, len)
rng2 = MersenneTwister(12345)
b = rand(rng2, len)
function inner_product(a, b)
result = 0
for i in 1: length(a)
result += a[i] * b[i]
end
return result
end
#time a' * b
#time dot(a, b)
#time inner_product(a, b);
0.013442 seconds (56.76 k allocations: 3.167 MiB)
0.003484 seconds (106 allocations: 6.688 KiB)
0.008489 seconds (17.52 k allocations: 976.752 KiB)
(I know using BenchmarkTools.jl is a more standard way to measure the performance.)
From the output, dot runs faster than for than '*, which is a contradiction to what has been presumed.
So my question is,
does Julia need (or sometimes need) vectorization to speed up computation?
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?

Your are not making the benchmarks correctly and the implementation of your function is suboptimal.
julia> using BenchmarkTools
julia> #btime $a' * $b
429.400 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime dot($a,$b)
426.299 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime inner_product($a, $b)
970.500 μs (0 allocations: 0 bytes)
249985.36801903677
The correct implementation:
function inner_product_correct(a, b)
result = 0.0 #use the same type as elements in the args
#simd for i in 1: length(a)
#inbounds result += a[i] * b[i]
end
return result
end
julia> #btime inner_product_correct($a, $b)
530.499 μs (0 allocations: 0 bytes)
249985.36801902478
There is still the difference (however less significant) because dot is using the optimized BLAS implementation which is multi-threaded. You could (following Bogumil's comment set OPENBLAS_NUM_THREADS=1 and then you will find that the times of BLAS function will be identical as the Julia implementation.
Note also that working with float numbers is tricky in many ways:
julia> inner_product_correct(a, b)==dot(a,b)
false
julia> inner_product_correct(a, b) ≈ dot(a,b)
true
Finally, in Julia deciding whether to use vectorization or write the loop yourself is up two you - there is no performance penalty (as long as you write type stable code and use #simd and #inbounds where required). However in your codes you were not testing vectorization but you were comparing calling BLAS to writing the loop yourself. Here is the must-read to understand what is going on https://docs.julialang.org/en/v1/manual/performance-tips/

Let me add my practical experience as a comment (too long for a standard comment):
does Julia need (or sometimes need) vectorization to speed up computation?
Julia does not need vectorization as Python (see the answer by Przemysław), but in practice if you have a well written vectorized function (like dot) then use it as, while possible, it might be sometimes tricky to write as performant function yourself (people have probably spent days on optimizing dot, especially to optimally use multiple threads).
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
When you use vectorized code then it all depends on implementation of the function you want to use. In this case dot(a, b) and a' * b are exactly the same as when you write #edit a' * b gives you in this case:
*(u::AdjointAbsVec{<:Number}, v::AbstractVector{<:Number}) = dot(u.parent, v)
and you see it is the same.
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Julia is a compiled language, while Python is an interpreted language. In some cases Python interpreter can provide fast execution speed, but in other cases it currently is not able to do it (but it does not mean that in the future it will not improve). In particular vectorized functions (like dot in your question) are most likely written in some compiled language, so Julia and Python will not differ much in typical cases as they just call this compiled function. However, when you use loops (non-vectorized code) then currently Python will be slower than Julia.

Related

Julia 1.5.2 Performance Questions

I am currently attempting to implement a metaheuristic (genetic) algorithm. In this venture i also want to try and create somewhat fast and efficient code. However, my experience in creating efficient coding is not very great. I was therefore wondering if some people could give some "quick tips" to increase the efficiency of my code. I have created a small functional example of my code which contains most of the elements that the code will contain i regards to preallocating arrays, custom mutable structs, random numbers, pushing into arrays etc.
The options that I have already attempted to explore are options in regards to the package "StaticArrays". However many of my arrays must be mutable (so we need MArrays) and many of them will become very large > 100. The documentation of StaticArrays specify that the size of the StaticArrays package must remain small to remain efficient.
According to the documentation Julia 1.5.2 should be thread safe in regards to rand(). I have therefor attempted to multithread for-loops in my functions to make them run faster. And this results in a slight performance increase .
However if people can se a more efficient way of allocating Arrays or pushing in SpotPrices into an array it would be greatly appreciated! Any other performance tips are also very welcome!
# Packages
clearconsole()
using DataFrames
using Random
using BenchmarkTools
Random.seed!(42)
df = DataFrame( SpotPrice = convert(Array{Float64}, rand(-266:500,8832)),
month = repeat([1,2,3,4,5,6,7,8,9,10,11,12]; outer = 736),
hour = repeat([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]; outer = 368))
# Data structure for the prices per hour
mutable struct SpotPrices
hour :: Array{Float64,1}
end
# Fill-out data structure
function setup_prices(df::DataFrame)
prices = []
for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
prices = setup_prices(df)
# Sampler function
function MC_Sampler(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
samples = MC_Sampler(prices, 100)
#btime setup_prices(df)
#btime MC_Sampler(prices,100)
function setup_prices_par(df::DataFrame)
prices = []
#sync Threads.#threads for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
# Sampler function
function MC_Sampler_par(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
#sync Threads.#threads for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
#btime setup_prices_par(df)
#btime MC_Sampler_par(prices,100)
Have a look at read very carefully https://docs.julialang.org/en/v1/manual/performance-tips/
Basic cleanups start with:
Your SpotPrices struct does not need to me mutable. Anyway since there is only one field you could just define it as SpotPrices=Vector{Float64}
You do not want untyped containers - instead of prices = [] do prices = Float64[]
Using DataFrames.groupby will be much faster than finding unique elements and filtering by them
If yo do not need initialze than do not do it Vector{Float64}(undef, sample_size) is much faster than zeros(sample_size, 24)
You do not need to synchronize #sync before a multi-threaded loop
Create a random states - one separate one for each thread and use them whenever calling the rand function

Multi-threaded parallelism performance problem with Fibonacci sequence in Julia (1.3)

I'm trying the multithread function of Julia 1.3 with the following Hardware:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2.8 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
When running the following script:
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
#time F(43)
it gives me the following output
2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437
However when running the following code copied from the Julia page about multithreading
import Base.Threads.#spawn
function fib(n::Int)
if n < 2
return n
end
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
end
fib(43)
what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)
What am I doing wrong?
Great question.
This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.
The problem is that #spawn has a non-trivial overhead of around 1µs, so if you spawn a thread to do a task that takes less than 1µs, you've probably hurt your performance. The recursive definition of fib(n) has exponential time complexity of order 1.6180^n [1], so when you call fib(43), you spawn something of order 1.6180^43 threads. If each one takes 1µs to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.
Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the #spawn overhead.
Note that there is work going into lessening the overhead of #spawn, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib implementation.
If you're curious about how we could modify the threaded fib function to actually be beneficial, the easiest thing to do would be to only spawn a fib thread if we think it will take significantly longer than 1µs to run. On my machine (running on 16 physical cores), I get
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
julia> #btime F(23);
122.920 μs (0 allocations: 0 bytes)
so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:
function fib(n::Int)
if n < 2
return n
elseif n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return fib(n-1) + fib(n-2)
end
end
now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find
julia> using BenchmarkTools
julia> #btime fib(43)
971.842 ms (1496518 allocations: 33.64 MiB)
433494437
julia> #btime F(43)
1.866 s (0 allocations: 0 bytes)
433494437
#Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?
Yes it is. The problem with the above function is that the function body is larger than that of F, with lots of conditionals, function / thread spawning and all that. I invite you to compare #code_llvm F(10) #code_llvm fib(10). This means that fib is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n cases.
julia> #btime F(20);
28.844 μs (0 allocations: 0 bytes)
julia> #btime fib(20);
242.208 μs (20 allocations: 320 bytes)
Oh no! all that extra code that never gets touched for n < 23 is slowing us down by an order of magnitude! There's an easy fix though: when n < 23, don't recurse down to fib, instead call the single threaded F.
function fib(n::Int)
if n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return F(n)
end
end
julia> #btime fib(43)
138.876 ms (185594 allocations: 13.64 MiB)
433494437
which gives a result closer to what we'd expect for so many threads.
[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/
[2] The BenchmarkTools #btime macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.
#Anush
As an example of using memoization and multithreading manually
_fib(::Val{1}, _, _) = 1
_fib(::Val{2}, _, _) = 1
import Base.Threads.#spawn
_fib(x::Val{n}, d = zeros(Int, n), channel = Channel{Bool}(1)) where n = begin
# lock the channel
put!(channel, true)
if d[n] != 0
res = d[n]
take!(channel)
else
take!(channel) # unlock channel so I can compute stuff
#t = #spawn _fib(Val(n-2), d, channel)
t1 = _fib(Val(n-2), d, channel)
t2 = _fib(Val(n-1), d, channel)
res = fetch(t1) + fetch(t2)
put!(channel, true) # lock channel
d[n] = res
take!(channel) # unlock channel
end
return res
end
fib(n) = _fib(Val(n), zeros(Int, n), Channel{Bool}(1))
fib(1)
fib(2)
fib(3)
fib(4)
#time fib(43)
using BenchmarkTools
#benchmark fib(43)
But the speed up came from memmiozation and not so much multithreading. Lesson here is that we should think better algorithms before multithreading.

Element-wise variance of an iterator

What's a numerically-stable way of taking the variance of an iterator elementwise? As an example, I would like to do something like
var((rand(4,2) for i in 1:10))
and get back a (4,2) matrix which is the variance in each coefficient. This throws an error using Julia's Base var. Is there a package that can handle this? Or an easy (and storage-efficient) way to do this using the Base Julia function? Or does one need to be developed on its own?
I went ahead and implemented a Welford algorithm to calculate this:
# Welford algorithm
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
function componentwise_meanvar(A;bessel=true)
x0 = first(A)
n = 0
mean = zero(x0)
M2 = zero(x0)
delta = zero(x0)
delta2 = zero(x0)
for x in A
n += 1
delta .= x .- mean
mean .+= delta./n
delta2 .= x .- mean
M2 .+= delta.*delta2
end
if n < 2
return NaN
else
if bessel
M2 .= M2 ./ (n .- 1)
else
M2 .= M2 ./ n
end
return mean,M2
end
end
A few other algorithms are implemented in DiffEqMonteCarlo.jl as well. I'm surprised I couldn't find a library for this, but maybe will refactor this out someday.
See update below for a numerically stable version
Another method to calculate this:
srand(0) # reset random for comparing across implementations
moment2var(t) = (t[3]-t[2].^2./t[1])./(t[1]-1)
foldfunc(x,y) = (x[1]+1,x[2].+y,x[3].+y.^2)
moment2var(foldl(foldfunc,(0,zeros(1,1),zeros(1,1)),(rand(4,2) for i=1:10)))
Gives:
4×2 Array{Float64,2}:
0.0848123 0.0643537
0.0715945 0.0900416
0.111934 0.084314
0.0819135 0.0632765
Similar to:
srand(0) # reset random for comparing across implementations
# naive component-wise application of `var` function
map(var,zip((rand(4,2) for i=1:10)...))
which is the non-iterator version (or offline version in CS terminology).
This method is based on calculation of variance from mean and sum-of-squares. moment2var and foldfunc are just a helper functions, but it fits in one-line without them.
Comments:
Speedwise, this should be pretty good as well. Perhaps, StaticArrays and initializing the foldl's v0 with the correct eltype of the iterator would save even more time.
Benchmarking gave 5x speed advantage (and better memory usage) over componentwise_meanvar (from another answer) on a sample input.
Using moment2meanvar(t)=(t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1)‌​) gives both mean and variance like componentwise_meanvar.
As #ChrisRackauckas noted, this method suffers from numerical instability when number of elements to sum is large.
--- UPDATE with variant of method ---
A little abstraction of the question asks for a way to do a foldl (and reduce,foldr) on an iterator returning a matrix, element-wise and retaining shape. To do so, we can define an assisting function mfold which takes a folding-function and makes it fold matrices element-wise. Define it as follows:
mfold(f) = (x,y)->[f(t[1],t[2]) for t in zip(x,y)]
For this specific problem of variance, we can define the component-wise fold functions, and a final function to combine the moments into the variance (and mean if wanted). The code:
ff(x,y) = (x[1]+1,x[2]+y,x[3]+y^2) # fold and collect moments
moment2var(t) = (t[3]-t[2]^2/t[1])/(t[1]-1) # calc variance from moments
moment2meanvar(t) = (t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1))
We can see moment2meanvar works on a single vector as follows:
julia> moment2meanvar(foldl(ff,(0.0,0.0,0.0),[1.0,2.0,3.0]))
(2.0, 1.0)
Now to matrix-ize it using foldm (using .-notation):
moment2var.(foldl(mfold(ff),fill((0,0,0),(4,2)),(rand(4,2) for i=1:10)))
#ChrisRackauckas noted this is not numerically stable, and another method (detailed in Wikipedia) is better. Using foldm this could be implemented as:
# better fold function compensating the sums for stability
ff2(x,y) = begin
delta=y-x[2]
mean=x[2]+delta/(x[1]+1)
return (x[1]+1,mean,x[3]+delta*(y-mean))
end
# combine the collected information for the variance (and mean)
m2var(t) = t[3]/(t[1]-1)
m2meanvar(t) = (t[2],t[3]/(t[1]-1))
Again we have:
m2var.(foldl(mfold(ff2),fill((0,0.0,0.0),(4,2)),(rand(4,2) for i=1:10)))
Giving the same results (perhaps a little more accurately).
Or an easy (and storage-efficient) way to do this using the Base Julia function?
Out of curiosity, why is the standard solution of using var along the external dimension not good for you?
julia> var(cat(3,(rand(4,2) for i in 1:10)...),3)
4×2×1 Array{Float64,3}:
[:, :, 1] =
0.08847 0.104799
0.0946243 0.0879721
0.105404 0.0617594
0.0762611 0.091195
Obviously, I'm using cat here, which clearly is not very storage efficient, just so I can use the Base Julia function and your original generator syntax as per your question. But you could make this storage efficient as well, if you initialise your random values directly on a preallocated array of size (4,2,10), so that's not really an issue here.
Or did I misunderstand your question?
EDIT - benchmark in response to comments
function standard_var(Y, A)
for i in 1 : length(A)
Y[:,:,i], = next(A,i);
end
var(Y,3)
end
function testit()
A = (rand(4,2) for i in 1:10000);
Y = Array{Float64, 3}(4,2,length(A));
#time componentwise_meanvar(A); # as defined in Chris's answer above
#time standard_var(Y, A) # standard variance + using preallocation
#time var(cat(3, A...), 3); # standard variance without preallocation
return nothing
end
julia> testit()
0.004258 seconds (10.01 k allocations: 1.374 MiB)
0.006368 seconds (49.51 k allocations: 2.129 MiB)
5.954470 seconds (50.19 M allocations: 2.989 GiB, 71.32% gc time)

Parallelization of Piecewise Polynomial Evaluation

I am trying to evaluate points in a large piecewise polynomial, which is obtained from a cubic-spline. This takes a long time to do and I would like to speed it up.
As such, I would like to evaluate a points on a piecewise polynomial with parallel processes, rather than sequentially.
Code:
z = zeros(1e6, 1) ; % preallocate some memory for speed
Y = rand(11220,161) ; %some data, rand for generating a working example
X = 0 : 0.0125 : 2 ; % vector of data sites
pp = spline(X, Y) ; % get the piecewise polynomial form of the cubic spline.
The resulting structure is large.
for t = 1 : 1e6 % big number
hcurrent = ppval(pp,t); %evaluate the piecewise polynomial at t
z(t) = sum(x(t:t+M-1).*hcurrent,1) ; % do some operation of the interpolated value. Most likely not relevant to this question.
end
Unfortunately, with matrix form and using:
hcurrent = flipud(ppval(pp, 1: 1e6 ))
requires too much memory to process, so cannot be done. Is there a way that I can batch process this code to speed it up?
For scalar second arguments, as in your example, you're dealing with two issues. First, there's a good amount of function call overhead and redundant computation (e.g., unmkpp(pp) is called every loop iteration). Second, ppval is written to be general so it's not fully vectorized and does a lot of things that aren't necessary in your case.
Below is vectorized code code that take advantage of some of the structure of your problem (e.g., t is an integer greater than 0), avoids function call overhead, move some calculations outside of your main for loop (at the cost of a bit of extra memory), and gets rid of a for loop inside of ppval:
n = 1e6;
z = zeros(n,1);
X = 0:0.0125:2;
Y = rand(11220,numel(X));
pp = spline(X,Y);
[b,c,l,k,dd] = unmkpp(pp);
T = 1:n;
idx = discretize(T,[-Inf b(2:l) Inf]); % Or: [~,idx] = histc(T,[-Inf b(2:l) Inf]);
x = bsxfun(#power,T-b(idx),(k-1:-1:0).').';
idx = dd*idx;
d = 1-dd:0;
for t = T
hcurrent = sum(bsxfun(#times,c(idx(t)+d,:),x(t,:)),2);
z(t) = ...;
end
The resultant code takes ~34% of the time of your example for n=1e6. Note that because of the vectorization, calculations are performed in a different order. This will result in slight differences between outputs from ppval and my optimized version due to the nature of floating point math. Any differences should be on the order of a few times eps(hcurrent). You can still try using parfor to further speed up the calculation (with four already running workers, my system took just 12% of your code's original time).
I consider the above a proof of concept. I may have over-optmized the code above if your example doesn't correspond well to your actual code and data. In that case, I suggest creating your own optimized version. You can start by looking at the code for ppval by typing edit ppval in your Command Window. You may be able to implement further optimizations by looking at the structure of your problem and what you ultimately want in your z vector.
Internally, ppval still uses histc, which has been deprecated. My code above uses discretize to perform the same task, as suggested by the documentation.
Use parfor command for parallel loops. see here, also precompute z vector as z(j) = x(j:j+M-1) and hcurrent in parfor for speed up.
The Spline Parameters estimation can be written in Matrix form.
Once you write it in Matrix form and solve it you can use the Model Matrix to evaluate the Spline on all data point using Matrix Multiplication which is probably the most tuned operation in MATLAB.

Julia and Lapack: pstrf multithreaded but trtrs not

My program written in Julia does not yield the expected computational performance. Basically the program first computes the Cholesky decomposition of a large matrix A using cholfact!, so that A = L'L. Then it solves Lx = b for different b using the backslash operator.
This results in straight calls to Lapack. The function cholfact! is implemented by pstrf! and the backslash operator uses trtrs!. These are the correct Lapack functions to use. While the function pstrf! is executed in parallel, the function trtrs! is not. The profiler tells me that most of the runtime is spent on trtrs!. The lines of code in my program are
F = cholfact!(A, :L, pivot = true) # precomputation, executed once
and
x = F[:L]\b[F.piv] # inside a loop, b is computed from x every step
Why is there a difference between the two Lapack functions? How can I get parallel execution of pstrf!?

Resources