In the code below, the inner loop creates a vector of all 1's, and then sets each value of the vector to 0. The sum of this vector is expected to then be 0. But, if I use Threads.#threads on the inner loop, sometimes it isn't. Am I violating a rule about multithreading with this code? It seems that the inner loop does not always "finish". I have JULIA_NUM_THREADS set to 4.
N = 1000
M = 1000
for i in 1:N
test = trues(M)
Threads.#threads for j in 1:M
test[j] = 0
end
s = sum(test)
if s > 0
println("sum(M) was ", s, "!")
end
end
println("done!")
Sample output:
There do not seem to be any problems with using Threads.#threads on the outer loop. How can I predict where it is OK to use Threads.#threads and where it is not?
This is because trues creates a BitArray, where boolean values are efficiently packed as individual bits in 64-bit chunks. As a result, accesses to adjacent indices can be racy.
This is discussed in the following issue, and should be documented as soon as this PR is merged.
The problem disappears if you use an other array type, such as Vector{Bool}:
N = 1000
M = 1000
for i in 1:N
test = ones(Bool, M) # creates a Vector{Bool}
Threads.#threads for j in 1:M
test[j] = 0
end
s = sum(test)
if s > 0
println("sum(M) was ", s, "!")
end
end
println("done!")
Aside from the race condition you should also be aware that iterating over and modifying elements in a BitArray is much slower than doing the same with an Array{Bool}. BitArrays are memory-efficient, and can be extremely fast for chunked operations, but are not good for individual element access:
using BenchmarkTools, Random
function itertest!(a)
for i in eachindex(a)
#inbounds a[i] = !a[i]
end
end
julia> #btime itertest!(a) setup=(a=rand(Bool, 1000));
70.050 ns (0 allocations: 0 bytes)
julia> #btime itertest!(a) setup=(a=bitrand(1000));
2.211 μs (0 allocations: 0 bytes)
On the other hand, for chunked operations, performance is stellar:
function dottest!(a)
a .= .!a
end
julia> #btime dottest!(a) setup=(a=rand(Bool, 1000));
71.795 ns (0 allocations: 0 bytes)
julia> #btime dottest!(a) setup=(a=bitrand(1000));
6.904 ns (0 allocations: 0 bytes)
Related
I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.
Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.
I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.
Sequential code
using BenchmarkTools
function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
println("total = ", total)
end
#btime slow(Int64(1e8), "9")
Time: 8.034s
Shared memory parallelism with Threads.#threads on 4 threads
using BenchmarkTools
using Base.Threads
function slow(n::Int, digits::String)
total = Atomic{Float64}(0)
#threads for i in 1:n
if !occursin(digits, string(i))
atomic_add!(total, 1.0 / i)
end
end
println("total = ", total)
end
#btime slow(Int64(1e8), "9")
Time: 6.938s
Speedup: 1.2
Shared memory parallelism with FLoops on 4 threads
using BenchmarkTools
using FLoops
function slow(n::Int, digits::String)
total = 0.0
#floop for i in 1:n
if !occursin(digits, string(i))
#reduce(total += 1.0 / i)
end
end
println("total = ", total)
end
#btime slow(Int64(1e8), "9")
Time: 10.850s
No speedup: slower than the sequential code.
Tests on various numbers of threads (different hardware)
I tested the sequential and Threads.#threads code on a different machine and experimented with various numbers of threads.
Here are the results:
Number of threads
Speedup
2
1.2
4
1.2
8
1.0 (no speedup)
16
0.9 (the code takes longer to run than the sequential code)
For heavier computations (n = 1e9 in the code above) which would minimize the relative effect of any overhead, the results are very similar:
Number of threads
Speedup
2
1.1
4
1.3
8
1.1
16
0.8 (the code takes longer to run than the sequential code)
For comparison: same loop with Chapel showing perfect scaling
Code run with Chapel v1.23.0:
use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
if (i: string).find(digits) == -1 then
total += 1.0 / i;
}
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
total, watch.elapsed());
First run (same hardware as the first Julia tests):
Number of threads
Time (s)
Speedup
1
13.33
n/a
2
7.34
1.8
Second run (same hardware):
Number of threads
Time (s)
Speedup
1
13.59
n/a
2
6.83
2.0
Third run (different hardware):
Number of threads
Time (s)
Speedup
1
19.99
n/a
2
10.06
2.0
4
5.05
4.0
8
2.54
7.9
16
1.28
15.6
Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.
Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:
julia> function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
"total = $total"
end
slow (generic function with 1 method)
julia> #btime slow(Int64(1e5), "9")
6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"
julia> using ThreadsX
julia> function slow_thread_thx(n::Int, digits::String)
total = ThreadsX.mapreduce(+,1:n) do i
if !occursin(digits, string(i))
1.0 / i
else
0.0
end
end
"total = $total"
end
julia> #btime slow_thread_thx(Int64(1e5), "9")
1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"
With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.
Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.
As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8} instead of String:
using BenchmarkTools
using Transducers
using Distributed
function string_vector!(a::Vector{UInt8}, x::Unsigned)
n = ndigits(x)
length(a) < n && error("Vector too short")
i = n
#inbounds while i >= 1
d, r = divrem(x, 0x0a)
a[i] = 0x30 + r
x = oftype(x, d)
i -= 1
end
a
end
function slow_no_garbage(n::UInt, digits::String)
digits = collect(codeunits(digits))
thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 0x1:n)
"total = $total"
end
println(#btime slow_no_garbage(UInt(1e8), "9"))
I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).
As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:
using BenchmarkTools
using Transducers
function slow_transducers(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 1:n)
"total = $total"
end
println(#btime slow_transducers(Int64(1e8), "9"))
and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):
using BenchmarkTools
using Transducers
using Distributed
function slow_distributed(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxd(+, Map(fun), 1:n)
"total = $total"
end
addprocs(parse(Int, ARGS[1]))
println(#btime slow_distributed(Int64(1e8), "9"))
The following table shows the results of running all versions with different number of threads/processes:
n
jling
slow_transducers
slow_distributed
slow_no_garbage
Chapel
1
4.242 s
4.224 s
4.241 s
2.743 s
7.32 s
2
2.952 s
2.958 s
2.168 s
1.447 s
3.73 s
4
2.135 s
2.147 s
1.163 s
0.716105 s
1.9 s
8
1.742 s
1.741 s
0.859058 s
0.360469 s
Speedup:
n
jling
slow_transducers
slow_distributed
slow_no_garbage
Chapel
1
1.0
1.0
1.0
1.0
1.0
2
1.43699
1.42799
1.95618
1.89565
1.96247
4
1.98689
1.9674
3.6466
3.83044
3.85263
8
2.43513
2.42619
4.9368
7.60953
As pointed out by previous answer, I also found the performance of multi-threading in Julia is largely influenced by garbage collection.
I used a simple trick by adding GC.gc() before the multi-threading task to "clean" the previous garbage. Note: this only works when the memory allocation is not too large.
BTW, you can use GC.enable_logging(true) to get the idea of how long GC takes (It is huge in my code!)
In Python, it is usually suggested to vectorize the code to make computation faster. For example, if you want to compute the inner product of two vectors, say a and b, usually
Code A
c = np.dot(a, b)
is better than
Code B
c = 0
for i in range(len(a)):
c += a[i] * b[i]
But in Julia, it seems sometimes vectorization is not that helpful. I reckoned '* and dot as vectorized versions and an explicit for loop as a non-vectorized version and got the following result.
using Random
using LinearAlgebra
len = 1000000
rng1 = MersenneTwister(1234)
a = rand(rng1, len)
rng2 = MersenneTwister(12345)
b = rand(rng2, len)
function inner_product(a, b)
result = 0
for i in 1: length(a)
result += a[i] * b[i]
end
return result
end
#time a' * b
#time dot(a, b)
#time inner_product(a, b);
0.013442 seconds (56.76 k allocations: 3.167 MiB)
0.003484 seconds (106 allocations: 6.688 KiB)
0.008489 seconds (17.52 k allocations: 976.752 KiB)
(I know using BenchmarkTools.jl is a more standard way to measure the performance.)
From the output, dot runs faster than for than '*, which is a contradiction to what has been presumed.
So my question is,
does Julia need (or sometimes need) vectorization to speed up computation?
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Your are not making the benchmarks correctly and the implementation of your function is suboptimal.
julia> using BenchmarkTools
julia> #btime $a' * $b
429.400 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime dot($a,$b)
426.299 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime inner_product($a, $b)
970.500 μs (0 allocations: 0 bytes)
249985.36801903677
The correct implementation:
function inner_product_correct(a, b)
result = 0.0 #use the same type as elements in the args
#simd for i in 1: length(a)
#inbounds result += a[i] * b[i]
end
return result
end
julia> #btime inner_product_correct($a, $b)
530.499 μs (0 allocations: 0 bytes)
249985.36801902478
There is still the difference (however less significant) because dot is using the optimized BLAS implementation which is multi-threaded. You could (following Bogumil's comment set OPENBLAS_NUM_THREADS=1 and then you will find that the times of BLAS function will be identical as the Julia implementation.
Note also that working with float numbers is tricky in many ways:
julia> inner_product_correct(a, b)==dot(a,b)
false
julia> inner_product_correct(a, b) ≈ dot(a,b)
true
Finally, in Julia deciding whether to use vectorization or write the loop yourself is up two you - there is no performance penalty (as long as you write type stable code and use #simd and #inbounds where required). However in your codes you were not testing vectorization but you were comparing calling BLAS to writing the loop yourself. Here is the must-read to understand what is going on https://docs.julialang.org/en/v1/manual/performance-tips/
Let me add my practical experience as a comment (too long for a standard comment):
does Julia need (or sometimes need) vectorization to speed up computation?
Julia does not need vectorization as Python (see the answer by Przemysław), but in practice if you have a well written vectorized function (like dot) then use it as, while possible, it might be sometimes tricky to write as performant function yourself (people have probably spent days on optimizing dot, especially to optimally use multiple threads).
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
When you use vectorized code then it all depends on implementation of the function you want to use. In this case dot(a, b) and a' * b are exactly the same as when you write #edit a' * b gives you in this case:
*(u::AdjointAbsVec{<:Number}, v::AbstractVector{<:Number}) = dot(u.parent, v)
and you see it is the same.
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Julia is a compiled language, while Python is an interpreted language. In some cases Python interpreter can provide fast execution speed, but in other cases it currently is not able to do it (but it does not mean that in the future it will not improve). In particular vectorized functions (like dot in your question) are most likely written in some compiled language, so Julia and Python will not differ much in typical cases as they just call this compiled function. However, when you use loops (non-vectorized code) then currently Python will be slower than Julia.
I have the following simple code:
function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
minsofar = fill(n, Threads.nthreads())
# minsofar = n
Threads.#threads for i in 1:N
# for i in 1:N
for j in i+1:N
dist = hamming4(strings[i], strings[j])
if dist < minsofar[Threads.threadid()]
minsofar[Threads.threadid()] = dist
end
end
end
return minimum(minsofar)
#return minsofar
end
function ave_min(n, N)
ITER = 10
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = new_min
# print("New min ", new_min, ". New ave ", avesofar, "\n")
total = avesofar
for i in 1:ITER-1
strings = random_strings2(n, N)
new_min = find_min(strings, n, N)
avesofar = avesofar*(i/(i+1)) + new_min/(i+1)
print("Iteration ", i, ". New min ", new_min, ". New ave ", round(avesofar; digits=2), "\n")
end
return avesofar
end
N = 2^16
n = 99
print("Overall average ", ave_min(n, N), "\n")
When I run it on an AMD 8350 in linux the CPU usage is around 430% (instead of close to 800%).
Is it possible to make the parallelisation work more efficiently?
Also, I noticed a new very impressive looking package called LoopVectorization.jl. As I am computing the Hamming distance in what looks like a vectorizable way, is it possible to speed up the code this way too?
Can the code be vectorized using LoopVectorization.jl?
(I am completely new to Julia)
The parallelization of your code seems to be correct.
Most likely you are running it in Atom or other IDE. Atom by default is using only half o cores (more exactly using only physical not logical cores).
Eg.g running in Atom on my machine:
julia> Threads.nthreads()
4
What you need to do is to explicitly set JULIA_NUM_THREADS
Windows command line (still assuming 8 logical cores)
set JULIA_NUM_THREADS=8
Linux command line
export JULIA_NUM_THREADS=8
After doing that your code takes 100% on all my cores.
EDIT
After discussion - you can get the time down to around 20% of a single threaded time by using Distributed instead of Threads since this avoids memory sharing:
The code will look more or less like this:
using Distributed
addprocs(8)
#everywhere function hamming4(bits1::Integer, bits2::Integer)
return count_ones(bits1 ⊻ bits2)
end
function random_strings2(n, N)
mask = UInt128(1) << n - 1
return [rand(UInt128) & mask for i in 1:N]
end
function find_min(strings, n, N)
return #distributed (min) for i in 1:N-1
minimum(hamming4(strings[i], strings[j]) for j in i+1:N)
end
end
### ... the rest of code remains unchanged
I'm trying the multithread function of Julia 1.3 with the following Hardware:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2.8 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
When running the following script:
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
#time F(43)
it gives me the following output
2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437
However when running the following code copied from the Julia page about multithreading
import Base.Threads.#spawn
function fib(n::Int)
if n < 2
return n
end
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
end
fib(43)
what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)
What am I doing wrong?
Great question.
This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.
The problem is that #spawn has a non-trivial overhead of around 1µs, so if you spawn a thread to do a task that takes less than 1µs, you've probably hurt your performance. The recursive definition of fib(n) has exponential time complexity of order 1.6180^n [1], so when you call fib(43), you spawn something of order 1.6180^43 threads. If each one takes 1µs to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.
Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the #spawn overhead.
Note that there is work going into lessening the overhead of #spawn, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib implementation.
If you're curious about how we could modify the threaded fib function to actually be beneficial, the easiest thing to do would be to only spawn a fib thread if we think it will take significantly longer than 1µs to run. On my machine (running on 16 physical cores), I get
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
julia> #btime F(23);
122.920 μs (0 allocations: 0 bytes)
so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:
function fib(n::Int)
if n < 2
return n
elseif n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return fib(n-1) + fib(n-2)
end
end
now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find
julia> using BenchmarkTools
julia> #btime fib(43)
971.842 ms (1496518 allocations: 33.64 MiB)
433494437
julia> #btime F(43)
1.866 s (0 allocations: 0 bytes)
433494437
#Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?
Yes it is. The problem with the above function is that the function body is larger than that of F, with lots of conditionals, function / thread spawning and all that. I invite you to compare #code_llvm F(10) #code_llvm fib(10). This means that fib is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n cases.
julia> #btime F(20);
28.844 μs (0 allocations: 0 bytes)
julia> #btime fib(20);
242.208 μs (20 allocations: 320 bytes)
Oh no! all that extra code that never gets touched for n < 23 is slowing us down by an order of magnitude! There's an easy fix though: when n < 23, don't recurse down to fib, instead call the single threaded F.
function fib(n::Int)
if n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return F(n)
end
end
julia> #btime fib(43)
138.876 ms (185594 allocations: 13.64 MiB)
433494437
which gives a result closer to what we'd expect for so many threads.
[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/
[2] The BenchmarkTools #btime macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.
#Anush
As an example of using memoization and multithreading manually
_fib(::Val{1}, _, _) = 1
_fib(::Val{2}, _, _) = 1
import Base.Threads.#spawn
_fib(x::Val{n}, d = zeros(Int, n), channel = Channel{Bool}(1)) where n = begin
# lock the channel
put!(channel, true)
if d[n] != 0
res = d[n]
take!(channel)
else
take!(channel) # unlock channel so I can compute stuff
#t = #spawn _fib(Val(n-2), d, channel)
t1 = _fib(Val(n-2), d, channel)
t2 = _fib(Val(n-1), d, channel)
res = fetch(t1) + fetch(t2)
put!(channel, true) # lock channel
d[n] = res
take!(channel) # unlock channel
end
return res
end
fib(n) = _fib(Val(n), zeros(Int, n), Channel{Bool}(1))
fib(1)
fib(2)
fib(3)
fib(4)
#time fib(43)
using BenchmarkTools
#benchmark fib(43)
But the speed up came from memmiozation and not so much multithreading. Lesson here is that we should think better algorithms before multithreading.
I'm trying to implement the following formula in Julia for calculating the Gini coefficient of a wage distribution:
where
Here's a simplified version of the code I'm using for this:
# Takes a array where first column is value of wages
# (y_i in formula), and second column is probability
# of wage value (f(y_i) in formula).
function gini(wagedistarray)
# First calculate S values in formula
for i in 1:length(wagedistarray[:,1])
for j in 1:i
Swages[i]+=wagedistarray[j,2]*wagedistarray[j,1]
end
end
# Now calculate value to subtract from 1 in gini formula
Gwages = Swages[1]*wagedistarray[1,2]
for i in 2:length(Swages)
Gwages += wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
# Final step of gini calculation
return giniwages=1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
Swages=zeros(length(wagedistarray[:,1]))
for i in 1:length(wagedistarray[:,1])
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
#time result=gini(wagedistarray)
It gives a value of near zero, which is what you expect for a completely equal wage distribution. However, it takes quite a long time: 6.796 secs.
Any ideas for improvement?
Try this:
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
#time result=gini(wagedistarray)
Time before: 5.913907256 seconds (4000481676 bytes allocated, 25.37% gc time)
Time after: 0.134799301 seconds (507260 bytes allocated)
Time after (second run): elapsed time: 0.123665107 seconds (80112 bytes allocated)
The primary problems are that Swages was a global variable (wasn't living in the function) which is not a good coding practice, but more importantly is a performance killer. The other thing I noticed was length(wagedistarray[:,1]), which makes a copy of that column and then asks its length - that was generating some extra "garbage". The second run is faster because there is some compilation time the very first time the function is run.
You crank performance even higher using #inbounds, i.e.
function gini(wagedistarray)
nrows = size(wagedistarray,1)
Swages = zeros(nrows)
#inbounds for i in 1:nrows
for j in 1:i
Swages[i] += wagedistarray[j,2]*wagedistarray[j,1]
end
end
Gwages=Swages[1]*wagedistarray[1,2]
#inbounds for i in 2:nrows
Gwages+=wagedistarray[i,2]*(Swages[i]+Swages[i-1])
end
return 1-(Gwages/Swages[length(Swages)])
end
which gives me elapsed time: 0.042070662 seconds (80112 bytes allocated)
Finally, check out this version, which is actually faster than all and is also the most accurate I think:
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
Which has elapsed time: 0.00041119 seconds (721664 bytes allocated). The main benefit was changing from a O(n^2) double for loop to the O(n) cumsum.
IainDunning has already provided a good answer with code that is fast enough for practical purposes (the function gini2). If one enjoys performance tweaking, one can get an additional speed increase of a factor 20 by avoiding temporary arrays (gini3). See the following code that compares the performance of the two implementations:
using TimeIt
wagedistarray=zeros(10000,2)
for i in 1:size(wagedistarray,1)
wagedistarray[i,1]=1
wagedistarray[i,2]=1/10000
end
wages = wagedistarray[:,1]
wagefrequencies = wagedistarray[:,2];
# original code
function gini2(wagedistarray)
Swages = cumsum(wagedistarray[:,1].*wagedistarray[:,2])
Gwages = Swages[1]*wagedistarray[1,2] +
sum(wagedistarray[2:end,2] .*
(Swages[2:end]+Swages[1:end-1]))
return 1 - Gwages/Swages[end]
end
# new code
function gini3(wages, wagefrequencies)
Swages_previous = wages[1]*wagefrequencies[1]
Gwages = Swages_previous*wagefrequencies[1]
#inbounds for i = 2:length(wages)
freq = wagefrequencies[i]
Swages_current = Swages_previous + wages[i]*freq
Gwages += freq * (Swages_current+Swages_previous)
Swages_previous = Swages_current
end
return 1.0 - Gwages/Swages_previous
end
result=gini2(wagedistarray) # warming up JIT
println("result with gini2: $result, time:")
#timeit result=gini2(wagedistarray)
result=gini3(wages, wagefrequencies) # warming up JIT
println("result with gini3: $result, time:")
#timeit result=gini3(wages, wagefrequencies)
The output is:
result with gini2: 0.0, time:
1000 loops, best of 3: 321.57 µs per loop
result with gini3: -1.4210854715202004e-14, time:
10000 loops, best of 3: 16.24 µs per loop
gini3 is somewhat less accurate than gini2 due to sequential summation, one would have to use a variant of pairwise summation to increase accuracy.