Performance of Vectorized vs Devectorized vs Multithreaded Julia codes

Performance of Vectorized vs Devectorized vs Multithreaded Julia codes - multithreading

I have a large array of floating point numbers. I am multiplying the array by a scalar. What is the best(fastest) way to do it in Julia?
using BenchmarkTools
global const dim1 = 2000
global const dim2 = 2000
global const α = 0.9
global const α_array = α.*eye(dim2, dim2)
function decay1(x)
x .*= α
end
function decay2(x)
for j in 1:dim2
for i in 1:dim1
#inbounds x[i,j] *= α
end
end
end
function decay3(x)
Threads.#threads for j in 1:dim2
for i in 1:dim1
#inbounds x[i, j] *= α
end
end
end
function decay4(x)
x *= α_array
end
function decay5(x)
scale!(x, α)
end
x = ones(dim1, dim2)
print("\nVectorized:\n")
#btime decay1(x)
x = ones(dim1, dim2)
print("\nDevectorized:\n")
#btime decay2(x)
x = ones(dim1, dim2)
print("\nMultithreaded:\n")
#btime decay3(x)
x = ones(dim1, dim2)
print("\nMultithreaded array muliplication:\n")
#btime decay4(x)
x = ones(dim1, dim2)
print("\nScale:\n")
#btime decay5(x)
decay1 is the vectorized implementation, decay2 is devectorized and decay3 is multithreaded with 4 threads.
I am seeing the following timings.
Vectorized:
2.291 ms (4 allocations: 112 bytes)
Devectorized:
2.221 ms (0 allocations: 0 bytes)
Multithreaded:
1.963 ms (1 allocation: 32 bytes)
Multithreaded array muliplication:
87.418 ms (2 allocations: 30.52 MiB)
Scale:
2.042 ms (0 allocations: 0 bytes)
The amount of speedup is clearly too low. What am I doing wrong? How can I do it better?

Related

How to speed up the calculation of a lot of small covariance in NumPy?

Is it possible to speed up small covariance calculations in NumPy? The function "diff_cov_ridge" is called millions of times in my program.
"theta" is a scalar, and "tx", "ty", "img1", "ix1", "iy1", "x1", "y1", "img2", "ix2", "iy2", "x2", "y2" are length n vectors.
def cov(a, b):
return np.cov(a, b)[0, 1]
def diff_cov_ridge(theta, tx, ty, img1, ix1, iy1, x1, y1, img2, ix2, iy2, x2, y2):
ct = np.cos(theta)
st = np.sin(theta)
eq1 = cov(img1, ix2*x2)
eq2 = cov(img1, ix2*y2)
eq3 = cov(img1, iy2*x2)
eq4 = cov(img1, iy2*y2)
eq5 = cov(img2, ix1*x1)
eq6 = cov(img2, ix1*y1)
eq7 = cov(img2, iy1*x1)
eq8 = cov(img2, iy1*y1)
eq9 = cov(ix2, ix1*tx*x1)
eq10 = cov(ix1, ix2*tx*x2)
eq11 = cov(ix1*y1, ix2*tx)
eq12 = cov(ix1, ix2*tx*y2)
eq13 = cov(ix1*x1, ix2*x2)
eq14 = cov(ix1*x1, ix2*y2)
eq15 = cov(ix1*y1, ix2*x2)
eq16 = cov(ix1*y1, ix2*y2)
eq17 = cov(ix1, iy2*tx*x2)
eq18 = cov(ix1, iy2*tx*y2)
eq19 = cov(ix1*x1, iy2*ty)
eq20 = cov(ix1*y1, iy2*ty)
eq21 = cov(ix1*x1, iy2*x2)
eq22 = cov(ix1*x1, iy2*y2)
eq23 = cov(ix1*y1, iy2*x2)
eq24 = cov(ix1*y1, iy2*y2)
eq25 = cov(ix2, iy1*tx*x1)
eq26 = cov(ix2, iy1*tx*y1)
eq27 = cov(iy1, ix2*ty*x2)
eq28 = cov(iy1, ix2*ty*y2)
eq29 = cov(ix2*x2, iy1*x1)
eq30 = cov(ix2*y2, iy1*x1)
eq31 = cov(ix2*x2, iy1*y1)
eq32 = cov(ix2*y2, iy1*y1)
eq33 = cov(iy1*x1, iy2*ty)
eq34 = cov(iy1, iy2*ty*x2)
eq35 = cov(iy1*y1, iy2*ty)
eq36 = cov(iy1, iy2*ty*y2)
eq37 = cov(iy1*x1, iy2*x2)
eq38 = cov(iy1*x1, iy2*y2)
eq39 = cov(iy1*y1, iy2*x2)
eq40 = cov(iy1*y1, iy2*y2)

The definition of np.cov(a, b)[0, 1] is simply
np.sum((a - np.mean(a)) * (b - np.mean(b))) / (a.size - 1)
You can therefore avoid the computation of the diagonal elements and the indexing into a 2x2 matrix, which should speed up your computation by a factor of somewhere between 1.5x and 3x. A slightly faster formulation is
np.dot(a - a.mean(), b - b.mean()) / (a.size - 1)
Here is an informal timing test on very small (a.size == 10) arrays that shows the differences:
%timeit np.cov(a, b)[0, 1]
39.3 µs ± 751 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.sum((a - np.mean(a)) * (b - np.mean(b))) / (a.size - 1)
23.7 µs ± 370 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.dot(a - a.mean(), b - b.mean()) / (a.size - 1)
18 µs ± 83.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I strongly suspect that using the above formulations, you can pre-compute some of the quantities you need to avoid calling cov so many times.
You can break up the computation of covariance in the same way you do with variance:
((a * b).sum() - a.sum() * b.sum() / a.size) / (a.size - 1)
This gives an additional factor of 2x+ speedup:
%timeit ((a * b).sum() - a.sum() * b.sum() / a.size) / (a.size - 1)
8.03 µs ± 41.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The additional advantage here is that you can pre-compute many of your sums. For example, img1 appears in 4 of your equations, but you only need to compute img1.sum() once for all of them.

How to safely round-and-clamp from float64 to int64?

This question is about python/numpy, but it may apply to other languages as well.
How can the following code be improved to safely clamp large float values to the
maximum int64 value during conversion? (Ideally, it should still be efficient.)
import numpy as np
def int64_from_clipped_float64(x, dtype=np.int64):
x = np.round(x)
x = np.clip(x, np.iinfo(dtype).min, np.iinfo(dtype).max)
# The problem is that np.iinfo(dtype).max is imprecisely approximated as a
# float64, and the approximation leads to overflow in the conversion.
return x.astype(dtype)
for x in [-3.6, 0.4, 1.7, 1e18, 1e25]:
x = np.array(x, dtype=np.float64)
print(f'x = {x:<10} result = {int64_from_clipped_float64(x)}')
# x = -3.6 result = -4
# x = 0.4 result = 0
# x = 1.7 result = 2
# x = 1e+18 result = 1000000000000000000
# x = 1e+25 result = -9223372036854775808

The problem is that the largest np.int64 is 263 - 1, which is not representable in floating point. The same issue doesn't happen on the other end, because -263 is exactly representable.
So do the clipping half in float space (for detection) and in integer space (for correction):
def int64_from_clipped_float64(x, dtype=np.int64):
assert x.dtype == np.float64
limits = np.iinfo(dtype)
too_small = x <= np.float64(limits.min)
too_large = x >= np.float64(limits.max)
ix = x.astype(dtype)
ix[too_small] = limits.min
ix[too_large] = limits.max
return ix

Here is a generalization of the answer by orlp# to safely clip-convert from
arbitrary floats to arbitrary integers, and to support scalar values as input.
The function is also useful for the conversion of np.float32 to np.int32
because it avoids the creation of intermediate np.float64 values,
as seen in the timing measurements.
def int_from_float(x, dtype=np.int64):
x = np.asarray(x)
assert issubclass(x.dtype.type, np.floating)
input_is_scalar = x.ndim == 0
x = np.atleast_1d(x)
imin, imax = np.iinfo(dtype).min, np.iinfo(dtype).max
fmin, fmax = x.dtype.type((imin, imax))
too_small = x <= fmin
too_large = x >= fmax
ix = x.astype(dtype)
ix[too_small] = imin
ix[too_large] = imax
return ix.item() if input_is_scalar else ix
print(int_from_float(np.float32(3e9), dtype=np.int32)) # 2147483647
print(int_from_float(np.float32(5e9), dtype=np.uint32)) # 4294967295
print(int_from_float(np.float64(1e25), dtype=np.int64)) # 9223372036854775807
a = np.linspace(0, 5e9, 1_000_000, dtype=np.float32).reshape(1000, 1000)
%timeit int_from_float(np.round(a), dtype=np.int32)
# 100 loops, best of 3: 3.74 ms per loop
%timeit np.clip(np.round(a), np.iinfo(np.int32).min, np.iinfo(np.int32).max).astype(np.int32)
# 100 loops, best of 3: 5.56 ms per loop

Slower times in multithreading using Julia 1.3.1

I started recently using the new multithreading interface in the 1.3.1 version. After I tried the fibonacci example in this blog post and getting significant speedups, I started experimenting with some old algorithms of mine.
I have a function that uses the trapezoid method to calculate integrals, both below or above a curve:
function trapezoid( x :: AbstractVector ,
y :: AbstractVector ;
y0 :: Number = 0.0 ,
inv :: Number = NaN )
int = zeros(length(x)-1)
for i = 2:length(x)
if isnan(inv) == true
int[i-1] = (y[i]+y[i-1]-2y0) * (x[i]-x[i-1]) / 2
else
int[i-1] = (2inv-(y[i]+y[i-1])-2y0) * (x[i]-x[i-1]) / 2
end # if
end # for
integral = sum(int) ;
return integral
end
Then I have a very inefficient algorithm that determines the midpoint index of a curve comparing the area below and above the curve:
function EAM_without_threads( x :: Vector{Float64} ,
y :: Vector{Float64} ,
y0 :: Real ,
ymean :: Real )
approx = Vector{Float64}(undef,length(x)-1)
for i in 1:length(x)-1
x1 = #view(x[1:i ])
x2 = #view(x[i:end])
y1 = #view(y[1:i ])
y2 = #view(y[i:end])
Al = trapezoid( x1 , y1 , y0=y0 )
Au = trapezoid( x2 , y2 , inv=ymean )
approx[i] = abs(Al-Au)
end
minind = findmin(approx)[2]
return x[minind]
end
And:
function EAM_with_threads( x :: Vector{Float64} ,
y :: Vector{Float64} ,
y0 :: Real ,
ymean :: Real )
approx = Vector{Float64}(undef,length(x)-1)
for i in 1:length(x)-1
x1 = #view(x[1:i ])
x2 = #view(x[i:end])
y1 = #view(y[1:i ])
y2 = #view(y[i:end])
Al = #spawn trapezoid( x1 , y1 , y0=y0 )
Au = #spawn trapezoid( x2 , y2 , inv=ymean )
approx[i] = abs(fetch(Al)-fetch(Au))
end
minind = findmin(approx)[2]
return x[minind]
end
This is what I used to try both functions:
using SpecialFunctions
using BenchmarkTools
x = collect(-10.0:5e-4:10.0)
y = erf.(x)
And then got these results:
julia> #btime EAM_without_threads(x,y,-1.0,1.0)
7.515 s (315905 allocations: 11.94 GiB)
julia> #btime EAM_with_threads(x,y,-1.0,1.0)
10.295 s (1274131 allocations: 12.00 GiB)
I don't understand... Using htop I can see that all my 8 threads are working almost at full capacity. This is my machine:
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-4712MQ CPU # 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Environment:
JULIA_NUM_THREADS = 8
I know about the overhead of dealing with several threads, and in small problems I understand if it's slower, but why in this case?
I'm also searching for multithreading "good practices", because I guess not every piece of code will benefit from parallelism.
Thank you all in advance.

Your code is doing some very redundant work here. It's doing a full trapezoidal integral for each step, instead of just updating Al and Au incrementally. Here I've rewritten the code so that it does zero allocations, and my version of the EAM is on my computer 5 orders of magnitude faster than the original, without using any threads.
In general: before you start looking into things like threading, consider whether your algorithm is efficient. You can get much bigger speedups from a fast algorithm than from threading.
function trapz(x, y; y0=0.0, inv=NaN)
length(x) != length(y) && error("Input arrays cannot have different lengths")
s = zero(eltype(x))
if isnan(inv)
#inbounds for i in eachindex(x, y)[1:end-1]
s += (y[i+1] + y[i] - 2y0) * (x[i+1] - x[i])
end
else
#inbounds for i in eachindex(x, y)[1:end-1]
s += (2inv - (y[i+1] + y[i]) - 2y0) * (x[i+1] - x[i])
end
end
return s / 2
end
function eam(x, y, y0, ymean)
length(x) != length(y) && error("Input arrays cannot have different lengths")
Au = trapz(x, y; inv=ymean)
Al = zero(Au)
amin = abs(Al - Au)
ind = firstindex(x)
#inbounds for i in eachindex(x, y)[2:end-1] # 2:length(x)-1
Al += (y[i] + y[i-1] - 2y0) * (x[i] - x[i-1]) / 2
Au -= (2ymean - (y[i] + y[i-1])) * (x[i] - x[i-1]) / 2
aval = abs(Al - Au)
if aval < amin
(amin, ind) = (aval, i)
end
end
return x[ind]
end
Benchmarks here (I use #time for your code and #btime for my own, since it would just be too time consuming to use #btime on really slow code):
julia> x = collect(-10.0:5e-4:10.0);
julia> y = erf.(x);
julia> #time EAM_without_threads(x, y, -1.0, 1.0)
15.611004 seconds (421.72 k allocations: 11.942 GiB, 11.73% gc time)
0.0
julia> #btime eam($x, $y, -1.0, 1.0)
181.991 μs (0 allocations: 0 bytes)
0.0
A small extra remark: you should not write if isnan(inv) == true, that is redundant. Just write if isnan(inv).

Try this
function EAM_with_threads( x :: Vector{Float64} ,
y :: Vector{Float64} ,
y0 :: Real ,
ymean :: Real )
approx = Vector{Float64}(undef,length(x)-1)
Threads.#threads for i in 1:length(x)-1
x1 = #view(x[1:i ])
x2 = #view(x[i:end])
y1 = #view(y[1:i ])
y2 = #view(y[i:end])
Al = trapezoid( x1 , y1 , y0=y0 )
Au = trapezoid( x2 , y2 , inv=ymean )
approx[i] = abs(Al-Au)
end
minind = findmin(approx)[2]
return x[minind]
end
Your for loop is easily parallelize, so the lowest fruit is to do each iteration of the "for loop" in parallel. It is much easier to reduce the overall time taken by doing this in parallel then to try and parallelize the internal instance of a "for loop".
I know about the overhead of dealing with several threads, and in
small problems I understand if it's slower, but why in this case?
Well, I think your first problem is that you didn't benchmark how long it takes to do
trapezoid( x1 , y1 , y0=y0 )
If you did, you will find that it takes hardly any time at all. Anything that does not take up a substantial amount of time is not worth doing in parallel. If A and B is independent and they both take up a long time then you should do A and B in parallel. Otherwise find something else to parallelize first.
Lets look at what you have
x = collect(-10.0:5e-4:10.0)
and
for i in 1:length(x)-1
So basically your for loop has around 40000 iterations
Your multithreading method takes
total_time = setup_time * 40000 + ind_work_time/2 * 40000
Where as parallelizing the for loop takes
total_time = setup_time * 1 + ind_work_time * 40000/8
For comparison, the non-multithreaded method take
total_time = ind_work_time * 40000

Cart-Pole Python Performance Comparison

I am comparing a cart and pole simulation with python 3.7 and Julia 1.2. In python the simulation is written as a class object as seen below, and in Julia is is just a function. I am getting a consistent 0.2 seconds time to solve using Julia which is much slower than python. I do not understand Julia well enough to understand why. My guess is it has something to do with compiling or the way the loop is set up.
import math
import random
from collections import namedtuple
RAD_PER_DEG = 0.0174533
DEG_PER_RAD = 57.2958
State = namedtuple('State', 'x x_dot theta theta_dot')
class CartPole:
""" Model for the dynamics of an inverted pendulum
"""
def __init__(self):
self.gravity = 9.8
self.masscart = 1.0
self.masspole = 0.1
self.length = 0.5 # actually half the pole's length
self.force_mag = 10.0
self.tau = 0.02 # seconds between state updates
self.x = 0
self.x_dot = 0
self.theta = 0
self.theta_dot = 0
#property
def state(self):
return State(self.x, self.x_dot, self.theta, self.theta_dot)
def reset(self, x=0, x_dot=0, theta=0, theta_dot=0):
""" Reset the model of a cartpole system to it's initial conditions
" theta is in radians
"""
self.x = x
self.x_dot = x_dot
self.theta = theta
self.theta_dot = theta_dot
def step(self, action):
""" Move the state of the cartpole simulation forward one time unit
"""
total_mass = self.masspole + self.masscart
pole_masslength = self.masspole * self.length
force = self.force_mag if action else -self.force_mag
costheta = math.cos(self.theta)
sintheta = math.sin(self.theta)
temp = (force + pole_masslength * self.theta_dot ** 2 * sintheta) / total_mass
# theta acceleration
theta_dotdot = (
(self.gravity * sintheta - costheta * temp)
/ (self.length *
(4.0/3.0 - self.masspole * costheta * costheta /
total_mass)))
# x acceleration
x_dotdot = temp - pole_masslength * theta_dotdot * costheta / total_mass
self.x += self.tau * self.x_dot
self.x_dot += self.tau * x_dotdot
self.theta += self.tau * self.theta_dot
self.theta_dot += self.tau * theta_dotdot
return self.state
To run the simulation the following code was used
from cartpole import CartPole
import time
cp = CartPole()
start = time.time()
for i in range(100000):
cp.step(True)
end = time.time()
print(end-start)
The Julia code is
function cartpole(state, action)
"""Cart and Pole simulation in discrete time
Inputs: cartpole( state, action )
state: 1X4 array [cart_position, cart_velocity, pole_angle, pole_velocity]
action: Boolean True or False where true is a positive force and False is a negative force
"""
gravity = 9.8
masscart = 1.0
masspole = 0.1
l = 0.5 # actually half the pole's length
force_mag = 10.0
tau = 0.02 # seconds between state updates
# x = 0
# x_dot = 0
# theta = 0
# theta_dot = 0
x = state[1]
x_dot = state[2]
theta = state[3]
theta_dot = state[4]
total_mass = masspole + masscart
pole_massl = masspole * l
if action == 0
force = force_mag
else
force = -force_mag
end
costheta = cos(theta)
sintheta = sin(theta)
temp = (force + pole_massl * theta_dot^2 * sintheta) / total_mass
# theta acceleration
theta_dotdot = (gravity * sintheta - costheta * temp)/ (l *(4.0/3.0 - masspole * costheta * costheta / total_mass))
# x acceleration
x_dotdot = temp - pole_massl * theta_dotdot * costheta / total_mass
x += tau * x_dot
x_dot += tau * x_dotdot
theta += tau * theta_dot
theta_dot += tau * theta_dotdot
new_state = [x x_dot theta theta_dot]
return new_state
end
The run code is
#time include("cartpole.jl")
function run_sim()
"""Runs the cartpole simulation
No inputs or ouputs
Use with #time run_sim() for timing puposes.
"""
state = [0 0 0 0]
for i = 1:100000
state = cartpole( state, 0)
#print(state)
#print("\n")
end
end
#time run_sim()

Your Python version takes 0.21s on my laptop. Here are timing results for the original Julia version on the same system:
julia> #time run_sim()
0.222335 seconds (654.98 k allocations: 38.342 MiB)
julia> #time run_sim()
0.019425 seconds (100.00 k allocations: 10.681 MiB, 37.52% gc time)
julia> #time run_sim()
0.010103 seconds (100.00 k allocations: 10.681 MiB)
julia> #time run_sim()
0.012553 seconds (100.00 k allocations: 10.681 MiB)
julia> #time run_sim()
0.011470 seconds (100.00 k allocations: 10.681 MiB)
julia> #time run_sim()
0.025003 seconds (100.00 k allocations: 10.681 MiB, 52.82% gc time)
The first run includes JIT compilation and takes ~0.2s whereas after that each run takes 10-20ms. That breaks down into ~10ms of actual compute time and ~10s of garbage collection time triggered every four calls or so. That means that Julia is about 10-20x faster than Python, excluding JIT compilation time, which is not bad for a straight port.
Why not count JIT time when benchmarking? Because you don't actually care about how long it takes to run fast programs like benchmarks. You're timing small benchmark problems to extrapolate how long it it will take to run larger problems where speed really matters. JIT compilation time is proportional to the amount of code you're compiling not to problem size. So when solving larger problems that you actually care about, the JIT compilation will still only take 0.2s, which is a negligible fraction of execution time for large problems.
Now, let's see about making the Julia code even faster. This is actually very simple: use a tuple instead of a row vector for your state. So initialize the state as state = (0, 0, 0, 0) and then update the state similarly:
new_state = (x, x_dot, theta, theta_dot)
That's it, otherwise the code is identical. For this version the timings are:
julia> #time run_sim()
0.132459 seconds (479.53 k allocations: 24.020 MiB)
julia> #time run_sim()
0.008218 seconds (4 allocations: 160 bytes)
julia> #time run_sim()
0.007230 seconds (4 allocations: 160 bytes)
julia> #time run_sim()
0.005379 seconds (4 allocations: 160 bytes)
julia> #time run_sim()
0.008773 seconds (4 allocations: 160 bytes)
The first run still includes JIT time. Subsequent runs are now 5-10ms, which is about 25-40x faster than the Python version. Note that there are almost no allocations—small, fixed numbers of allocations are just for return values and won't trigger GC if this is called from other code in a loop.

Okay, so I've just run your Python and Julia code, and I get different results: 1.41 s for 10m iterations for Julia, 25.5 seconds for 10m iterations for Python. Already, Julia is 18x faster!
I think perhaps the issue is that #time is not accurate when run in global scope - you need multi-second timings for it to be accurate enough. You can use the package BenchmarkTools to get accurate timings of small functions.

Standard performance tips apply: https://docs.julialang.org/en/v1/manual/performance-tips/index.html
In particular, use dots to avoid allocations, and fuse loops. Also for this kind of small-array computations, consider using https://github.com/JuliaArrays/StaticArrays.jl which is much faster

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Performance of Vectorized vs Devectorized vs Multithreaded Julia codes - multithreading

Related

How to speed up the calculation of a lot of small covariance in NumPy?

How to safely round-and-clamp from float64 to int64?

Slower times in multithreading using Julia 1.3.1

Cart-Pole Python Performance Comparison

More efficient way to bruteforce finding solutions to (x+y)^2=str(x)+str(y)? Can it be vectorised?

Categories

Resources