I am trying to run the below mentioned code for loop using #threads for multi-threading, however the results always portray an error of mismatching number of outputs.
using Combinatorics, Base.Threads
import Base.Threads.#threads
func(x, y)
result = Float64[]
a = Float64[]
#threads for c in combinations(1:n, 2)
a, b = c
result = func(a, b)
push!(result, result)
push!(a, a)
end
The error observed during the processing:
DimensionMismatch("column :result has length 60 and column :a has length 50")
Please suggest an approach which make sure that no mutations from the loop are missed.
One thing that does not work in your example is that all threads are mutating (via push!) the same vectors (result and a) possibly concurrently, thereby allowing race conditions to happen.
One way around this would be to have a collection of vectors (one per thread); each thread only modifies its own vector (identified by it threadid()).
With such a technique, a simplified version of your example could look like this:
# The function we want to apply to each element
julia> f(x) = 2x+1
f (generic function with 1 method)
# Two collections of vectors (one vector for each thread)
# that will hold the results for each thread
julia> results = [Float64[] for _ in 1:Threads.nthreads()];
julia> as = [Float64[] for _ in 1:Threads.nthreads()]
8-element Vector{Vector{Float64}}:
[]
[]
[]
[]
[]
[]
[]
[]
julia> Threads.#threads for a in 1:10
result = f(a)
# Each thread only ever mutates its own result vector:
# results[Threads.threadid()]
push!(results[Threads.threadid()], result)
push!(as[Threads.threadid()], a)
end
Note that you'll get a collection of results, indexed by the id of the thread that produced them.
# Now you get a collection of results, indexed by the id of the thread which produced them
julia> results
8-element Vector{Vector{Float64}}:
[3.0, 5.0] # These results have been produced by thread #1
[7.0, 9.0]
[11.0]
[13.0]
[15.0]
[17.0]
[19.0]
[21.0]
julia> as
8-element Vector{Vector{Float64}}:
[1.0, 2.0]
[3.0, 4.0]
[5.0]
[6.0]
[7.0]
[8.0]
[9.0]
[10.0]
In the end, you therefore need to concatenate or flatten all resulting vectors in some way in order to combine all thread-specific results into one. One way would be to concatenate all results (which will allocate a new, large vector to hold all results):
julia> reduce(vcat, results)
10-element Vector{Float64}:
3.0
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
julia> reduce(vcat, as)
10-element Vector{Float64}:
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Another way would be to directly iterate on the nested results, flattening them on the fly (so as not to allocate double the memory to store them in a flat fashion):
julia> using Base.Iterators: flatten
julia> for r in flatten(results)
println(r)
end
3.0
5.0
7.0
9.0
11.0
13.0
15.0
17.0
19.0
21.0
julia> for (a, r) in zip(flatten(as), flatten(results))
println("$a -> $r")
end
1.0 -> 3.0
2.0 -> 5.0
3.0 -> 7.0
4.0 -> 9.0
5.0 -> 11.0
6.0 -> 13.0
7.0 -> 15.0
8.0 -> 17.0
9.0 -> 19.0
10.0 -> 21.0
Related
I recently had the following question
regarding the function mean! of the Statistics.jl package.
The following bug is reported regarding the behavior of mean!. As indicated, the mean! function does not properly consider that its arguments may alias each other. In some such cases the result from mean! is not correct:
julia> let a = [1 2 3]
mean!(a, a)
end
1×3 Array{Int64,2}:
0 0 0
julia> let a = [1 2 3]
mean!(copy(a), a)
end
1×3 Array{Int64,2}:
1 2 3
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Intel(R) Core(TM) i9-9980HK CPU # 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
However, I think that this behavior is normal since viewing the definition of mean!, the result of the operation mean!(r, v) is written in r. Therefore it seems logical to me that if you use the same object as variable r and as variable v, the result is unpredictable.
I have seen that this also happens with the sum! function.
Someone can tell me if I am right or indeed, there is something that I am not understanding.
mean! behaves the way you observe because it internally calls sum!.
Now sum! behaves this way for the following reason. It was designed to perform summation without making any allocations. Therefore the first thing sum! does is initializing the target vector to 0 (a neutral element of summation). After this is done your a vector contains only 0s, and thus later you get all 0s also.
However, indeed it would make sense that the sum! (and similar) functions docstring should mention that the target should not alias with the source. Here is another example of the same you have observed:
julia> x = [1 2 3
4 5 6
7 8 9]
3×3 Matrix{Int64}:
1 2 3
4 5 6
7 8 9
julia> y = view(x, :, 1)
3-element view(::Matrix{Int64}, :, 1) with eltype Int64:
1
4
7
julia> sum!(y, x)
3-element view(::Matrix{Int64}, :, 1) with eltype Int64:
5
11
17
julia> x
3×3 Matrix{Int64}:
5 2 3
11 5 6
17 8 9
I know that questions about multi-threading performance in Julia have already been asked (e.g. here), but they involve fairly complex code in which many things could be at play.
Here, I am running a very simple loop on multiple threads using Julia v1.5.3 and the speedup doesn't seem to scale up very well when compared to running the same loop with, for instance, Chapel.
I would like to know what I am doing wrong and how I could run multi-threading in Julia more efficiently.
Sequential code
using BenchmarkTools
function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
println("total = ", total)
end
#btime slow(Int64(1e8), "9")
Time: 8.034s
Shared memory parallelism with Threads.#threads on 4 threads
using BenchmarkTools
using Base.Threads
function slow(n::Int, digits::String)
total = Atomic{Float64}(0)
#threads for i in 1:n
if !occursin(digits, string(i))
atomic_add!(total, 1.0 / i)
end
end
println("total = ", total)
end
#btime slow(Int64(1e8), "9")
Time: 6.938s
Speedup: 1.2
Shared memory parallelism with FLoops on 4 threads
using BenchmarkTools
using FLoops
function slow(n::Int, digits::String)
total = 0.0
#floop for i in 1:n
if !occursin(digits, string(i))
#reduce(total += 1.0 / i)
end
end
println("total = ", total)
end
#btime slow(Int64(1e8), "9")
Time: 10.850s
No speedup: slower than the sequential code.
Tests on various numbers of threads (different hardware)
I tested the sequential and Threads.#threads code on a different machine and experimented with various numbers of threads.
Here are the results:
Number of threads
Speedup
2
1.2
4
1.2
8
1.0 (no speedup)
16
0.9 (the code takes longer to run than the sequential code)
For heavier computations (n = 1e9 in the code above) which would minimize the relative effect of any overhead, the results are very similar:
Number of threads
Speedup
2
1.1
4
1.3
8
1.1
16
0.8 (the code takes longer to run than the sequential code)
For comparison: same loop with Chapel showing perfect scaling
Code run with Chapel v1.23.0:
use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
if (i: string).find(digits) == -1 then
total += 1.0 / i;
}
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
total, watch.elapsed());
First run (same hardware as the first Julia tests):
Number of threads
Time (s)
Speedup
1
13.33
n/a
2
7.34
1.8
Second run (same hardware):
Number of threads
Time (s)
Speedup
1
13.59
n/a
2
6.83
2.0
Third run (different hardware):
Number of threads
Time (s)
Speedup
1
19.99
n/a
2
10.06
2.0
4
5.05
4.0
8
2.54
7.9
16
1.28
15.6
Someone can make a much more detailed analysis than me but the main reason naive Julia threading is performing badly is that your "task" in each iteration is way too light. Using an atomic lock, in this case, will imply huge overhead because all threads are just waiting for the lock way too often.
Since your Chapel code is doing a mapreduce, we can also try a parallel mapreduce in Julia:
julia> function slow(n::Int, digits::String)
total = 0.0
for i in 1:n
if !occursin(digits, string(i))
total += 1.0 / i
end
end
"total = $total"
end
slow (generic function with 1 method)
julia> #btime slow(Int64(1e5), "9")
6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"
julia> using ThreadsX
julia> function slow_thread_thx(n::Int, digits::String)
total = ThreadsX.mapreduce(+,1:n) do i
if !occursin(digits, string(i))
1.0 / i
else
0.0
end
end
"total = $total"
end
julia> #btime slow_thread_thx(Int64(1e5), "9")
1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"
With 4 threads. I've tested with other numbers of threads and confirmed the scaling is pretty linear.
Btw, just as a general tip, you should try to avoid printing in a benchmarked code because it makes a mess when timed repeatedly and also if your task is fast, STDIO can take nonnegligible time.
As jling suggests in the comments on their answer the problem here is most likely that the Julia code is allocating lots of memory that needs to be garbage collected. Chapel is, to my understanding, not a garbage-collected language and that could explain why this example scales more linearly. As a small test of this hypothesis, I benchmarked the following code that performs the same operations but with preallocated Vector{UInt8} instead of String:
using BenchmarkTools
using Transducers
using Distributed
function string_vector!(a::Vector{UInt8}, x::Unsigned)
n = ndigits(x)
length(a) < n && error("Vector too short")
i = n
#inbounds while i >= 1
d, r = divrem(x, 0x0a)
a[i] = 0x30 + r
x = oftype(x, d)
i -= 1
end
a
end
function slow_no_garbage(n::UInt, digits::String)
digits = collect(codeunits(digits))
thread_strings = [zeros(UInt8, 100) for _ in 1:Threads.nthreads()]
fun(i) = if Base._searchindex(string_vector!(thread_strings[Threads.threadid()], i), digits, 1) == 0
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 0x1:n)
"total = $total"
end
println(#btime slow_no_garbage(UInt(1e8), "9"))
I do not recommend using this code (especially since, because the numbers are always growing in length I don't properly clear the thread buffer between iterations, although that is easily fixed). However, it results in almost linear scaling with the number of threads (table at the end of the answer).
As jling also mentioned, if a lot of garbage is created distribution may be better than threading. The following two code snippets use Transducers.jl to run the code first using threads:
using BenchmarkTools
using Transducers
function slow_transducers(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxt(+, Map(fun), 1:n)
"total = $total"
end
println(#btime slow_transducers(Int64(1e8), "9"))
and then distributed to separate Julia processes (taking the number of processes as the first command-line argument):
using BenchmarkTools
using Transducers
using Distributed
function slow_distributed(n::Int, digits::String)
fun(i) = if !occursin(digits, string(i))
1.0 / i
else
0.0
end
total = foldxd(+, Map(fun), 1:n)
"total = $total"
end
addprocs(parse(Int, ARGS[1]))
println(#btime slow_distributed(Int64(1e8), "9"))
The following table shows the results of running all versions with different number of threads/processes:
n
jling
slow_transducers
slow_distributed
slow_no_garbage
Chapel
1
4.242 s
4.224 s
4.241 s
2.743 s
7.32 s
2
2.952 s
2.958 s
2.168 s
1.447 s
3.73 s
4
2.135 s
2.147 s
1.163 s
0.716105 s
1.9 s
8
1.742 s
1.741 s
0.859058 s
0.360469 s
Speedup:
n
jling
slow_transducers
slow_distributed
slow_no_garbage
Chapel
1
1.0
1.0
1.0
1.0
1.0
2
1.43699
1.42799
1.95618
1.89565
1.96247
4
1.98689
1.9674
3.6466
3.83044
3.85263
8
2.43513
2.42619
4.9368
7.60953
As pointed out by previous answer, I also found the performance of multi-threading in Julia is largely influenced by garbage collection.
I used a simple trick by adding GC.gc() before the multi-threading task to "clean" the previous garbage. Note: this only works when the memory allocation is not too large.
BTW, you can use GC.enable_logging(true) to get the idea of how long GC takes (It is huge in my code!)
In my computer science class we were using Haskell to solve the "queens" problem in which you must find all possible placements of n queens in an nxn board. This was the code we were given:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | partial <- solve(k-1), h <- [0..(n-1)], safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
However, the first time I entered it I accidentally swapped the order in solve k and found that it still gave a correct solution but took much longer:
queens n = solve n
where
solve 0 = [ [] ]
solve k = [ h:partial | h <- [0..(n-1)], partial <- solve(k-1), safe h partial ]
safe h partial = and [ not (checks h partial i) | i <- [0..(length partial)-1] ]
checks h partial i = h == partial!!i || abs(h-partial!!i) == i+1
Why does this second version take so much longer? My thought process is that the second version does recursion at every step while the first version does recursion only once and then backtracks. This is not for a homework problem, I'm just curious and feel like it will help me better understand the language.
Simply put,
[ ... | x <- f 42, n <- [1..100] ]
will evaluate f 42 once to a list, and for each element x in such list it will generate all ns from 1 to 100. Instead,
[ ... | n <- [1..100], x <- f 42 ]
will first generate an n from 1 to 100, and for each of them call f 42. So f is now being called 100 times instead of one.
This is no different from what happens in imperative programming when using nested loops:
for x in f(42): # calls f once
for n in range(1,100):
...
for n in range(1,100):
for x in f(42): # calls f 100 times
...
The fact that your algorithm is recursive makes this swap particularly expensive, since the additional cost factor (100, above) accumulates at each recursive call.
You can also try to bind the result of f 42 to some variable so that it does not need to be recomputed, even if you nest it the other way around:
[ ... | let xs = f 42, n <- [1..100], x <- xs ]
Note that this will keep the whole xs list in memory for the whole loop, preventing it from being garbage collected. Indeed, xs will be fully evaluated for n=1, and then reused for higher values of n.
My guess is that your first version does a depth-first traversal while your second version does a breadth-first traversal of the tree (see Tree Traversal on Wikipedia).
As the complexity of the problem grows with the size of the board, the second version uses more and more memory to keep track of each level of the tree while the first version quickly forgets the previous branch it visited.
Managing the memory takes a lot of time!
By enabling profiling, you can see how the Haskell runtime behaves with your functions.
If you compare the number of calls, they are strictly the same, but still the second version takes more time:
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 44 0 0.0 0.0 100.0 100.0
main Main 89 0 0.3 0.0 0.3 0.0
CAF Main 87 0 0.0 0.0 99.7 100.0
main Main 88 1 0.2 0.6 99.7 100.0
queens2 Main 94 1 0.0 0.0 55.6 48.2
queens2.solve Main 95 13 3.2 0.8 55.6 48.2
queens2.safe Main 96 10103868 42.1 47.5 52.3 47.5
queens2.checks Main 100 37512342 10.2 0.0 10.2 0.0
queens1 Main 90 1 0.0 0.0 43.9 51.1
queens1.solve Main 91 13 2.0 1.6 43.9 51.1
queens1.safe Main 92 10103868 29.3 49.5 41.9 49.5
queens1.checks Main 93 37512342 12.7 0.0 12.7 0.0
Looking at the heap profile tells you what really happens.
The first version has a small and constant heap use:
While the second version has a huge heap use which must also face garbage collection (look at the peaks):
Looking at the core, the first function generates a single function in core, which is tail recursive (constant stack space - very fast and very nice function. Thanks GHC!). However, the 2nd generates two functions: one to do a single step of the inner loop; and a 2nd function which looks like
loop x = case x of { 0 -> someDefault; _ -> do1 (loop (x-1)) }
This function likely isn't performant because do1 must traverse the entire input list, and each iteration appends new elements to the list (meaning the input list to do1 grows monotonically in length). Whereas the core function for the fast version is generating the output list directly, without having to process some other list. It is quite difficult to reason about the performance of list comprehension, I believe, so first translate the function to not use them:
guard b = if b then [()] else []
solve_good k =
concatMap (\partial ->
concatMap (\h ->
guard (safe h partial) >> return (h:partial)
) [0..n-1]
) (solve $ k-1)
solve_bad k =
concatMap (\h ->
concatMap (\partial ->
guard (safe h partial) >> return (h:partial)
) (solve $ k-1)
) [0..n-1]
The transformation is fairly mechanical and is detailed somewhere in the Haskell report, but essentially <- becomes concatMap and conditions become guards. It is much easier to see what is happening now - solve_good makes a recursive call a single time, then concatMaps over that recursively created list. However, solve_bad makes the recursive call inside the outer concatMap, meaning it will potentially (likely) be recomputed for every element in [0..n-1]. Note that there is no semantic reason for solve $ k-1 to be in the inner concatMap - it does not depend on the value that that concatMap binds (the h variable) so it can be safely lifted out above the concatMap which binds h (as is done in solve_good).
I'm looking for optimization oportunities in my Haskell program by compiling with -prof, but I don't know how to interpret the cost centres that contain ellipses. What are filter.(...) and jankRoulette.select.(...)?
COST CENTRE MODULE %time %alloc
filter.(...) Forest 46.5 22.3
set-union Forest 22.5 4.1
cache-lookup Forest 16.0 0.1
removeMany MultiMapSet 3.7 1.9
insertMany MultiMapSet 3.3 1.8
jankRoulette.select.(...) Forest 1.4 15.2
I generated that with: $ ghc --make -rtsopts -prof -auto-all main.hs && ./main +RTS -p && cat main.prof
The function filter has a few definitions in a where clause, like this:
filter a b = blahblah where
foo = bar
bar = baz
baz = bing
But those all show up as filter.foo, filter.bar, etc.
I thought they might be nested let expressions, but jankRoulette.select doesn't have any. And I've added SCC directives in front of most of them without any of those cost centres rising to the top.
Since most of the time is spent in filter.(...), I'd like to know what that is. :)
TL; DR: GHC generates this when you do a pattern match in a let binding, like let (x,y) = c. The cost of evaluating c is tracked by the ... cost centre (since there is no unique name to it)`.
So how did I find this out?
A grep for (...) in the GHC source code finds the following (from compiler/deSugar/Coverage.hs):
-- TODO: Revisit this
addTickLHsBind (L pos (pat#(PatBind { pat_lhs = lhs, pat_rhs = rhs }))) = do
let name = "(...)"
(fvs, rhs') <- getFreeVars $ addPathEntry name $ addTickGRHSs False False rhs
{- ... more code following, but not relevant to this purpose
-}
That code tells us that it has to do something with pattern bindings.
So we can make a small test program to check the behavior:
x :: Int
(x:_) = reverse [1..1000000000]
main :: IO ()
main = print x
Then, we can run this program with profiling enabled. An indeed, GHC generates the following output:
COST CENTRE MODULE no. entries %time %alloc %time
%alloc
MAIN MAIN 42 0 0.0 0.0 100.0 100.0
CAF Main 83 0 0.0 0.0 100.0 100.0
(...) Main 86 1 100.0 100.0 100.0 100.0
x Main 85 1 0.0 0.0 0.0 0.0
main Main 84 1 0.0 0.0 0.0 0.0
So it turns out the assumption made from the code was correct. All of the time of the program is spent evaluating the reverse [1..1000000000] expression, and it's assigned to the (...) cost centre.
Purely Functional Data Structures has the following exercise:
-- 2.5 Sharing can be useful within a single object, not just between objects.
-- For example, if the two subtress of a given node are identical, then they can
-- be represented by the same tree.
-- Part a: make a `complete a Int` function that creates a tree of
-- depth Int, putting a in every leaf of the tree.
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Does this implementation run in O(d) time? Could you please say why or why not?
The interesting part of the code is the complete' function:
complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
As Cirdec's answer suggests, we should be careful to analyze each part of the implementation to make sure our assumptions are valid. As a general rule, we can assume that the following take 1 unit of time each*:
Using a data constructor to construct a value (e.g., using Empty to make an empty tree or using Node to turn a value and two trees into a tree).
Pattern matching on a value to see what data constructor it was built from and what values the data constructor was applied to.
Guards and if/then/else expressions (which are implemented internally using pattern matching).
Comparing an Integer to 0.
Cirdec mentions that the operation of subtracting 1 from an Integer is logarithmic in the size of the integer. As they say, this is essentially an artifact of the way Integer is implemented. It is possible to implement integers so that it takes only one step to compare them to 0 and also takes only one step to decrement them by 1. To keep things very general, it's safe to assume that there is some function c such that the cost of decrementing an Integer is c(depth).
Now that we've taken care of those preliminaries, let's get down to work! As is generally the case, we need to set up a system of equations and solve it. Let f(d) be the number of steps needed to calculate complete' d. Then the first equation is very simple:
f(0) = 2
That's because it costs 1 step to compare d to 0, and another step to check that the result is True.
The other equation is the interesting part. Think about what happens when d > 0:
We calculate d == 0.
We check if that is True (it's not).
We calculate d-1 (let's call the result dm1)
We calculate complete' dm1, saving the result as copiedTree.
We apply a Node constructor to x, copiedTree, and copiedTree.
The first part takes 1 step. The second part takes one step. The third part takes c(depth) steps, and the fifth step takes 1 step. What about the fourth part? Well, that takes f(d-1) steps, so this will be a recursive definition.
f(0) = 2
f(d) = (3+c(depth)) + f(d-1) when d > 0
OK, now we're cooking with gas! Let's calculate the first few values of f:
f(0) = 2
f(1) = (3+c(depth)) + f(0) = (3+c(depth)) + 2
f(2) = (3+c(depth)) + f(1)
= (3+c(depth)) + ((3+c(depth)) + 2)
= 2*(3+c(depth)) + 2
f(3) = (3+c(depth)) + f(2)
= (3+c(depth)) + (2*(3+c(depth)) + 2)
= 3*(3+c(depth)) + 2
You should be starting to see a pattern by now:
f(d) = d*(3+c(depth)) + 2
We generally prove things about recursive functions using mathematical induction.
Base case:
The claim holds for d=0 because 0*(3+c(depth))+2=0+2=2=f(0).
Suppose that the claim holds for d=D. Then
f(D+1) = (3+c(depth)) + f(D)
= (3+c(depth)) + (D*(3+c(depth))+2)
= (D+1)*(3+c(depth))+2
So the claim holds for D+1 as well. Thus by induction, it holds for all natural numbers d. As a reminder, this gives the conclusion that complete' d takes
f(d) = d*(3+c(depth))+2
time. Now how do we express that in big O terms? Well, big O doesn't care about the constant coefficients of any of the terms, and only cares about the highest-order terms. We can safely assume that c(depth)>=1, so we get
f(d) ∈ O(d*c(depth))
Zooming out to complete, this looks like O(depth*c(depth))
If you use the real cost of Integer decrement, this gives you O(depth*log(depth)). If you pretend that Integer decrement is O(1), this gives you O(depth).
Side note: As you continue to work through Okasaki, you will eventually reach section 10.2.1, where you will see a way to implement natural numbers supporting O(1) decrement and O(1) addition (but not efficient subtraction).
* Haskell's lazy evaluation keeps this from being precisely true, but if you pretend that everything is evaluated strictly, you will get an upper bound for the true value, which will be good enough in this case. If you want to learn how to analyze data structures that use laziness to get good asymptotic bounds, you should keep reading Okasaki.
Theoretical Answer
No, it does not run in O(d) time. Its asymptotic performance is dominated by the the Integer subtraction d-1, which takes O(log d) time. This is repeated O(d) times, giving an asymptotic upper bound on time of O(d log d).
This upper bound can improve if you use an Integer representation with an asymptotically optimal O(1) decrement. In practice we don't, since the asymptotically optimal Integer implementations are slower even for unimaginably large values.
Practically the Integer arithmetic will be a small part of the running time of the program. For practical "large" depths (smaller than a machine word) the program's running time will be dominated by allocating and populating memory. For larger depths you will exhaust the resources of the computer.
Practical Answer
Ask the run time system's profiler.
In order to profile your code, we first need to make sure it is run. Haskell is lazily evaluated, so, unless we do something to cause the tree to be completely evaluated, it might not be. Unfortunately, completely exploring the tree will take O(2^d) steps. We could avoid forcing nodes we had already visited if we kept track of their StableNames. Fortunately, traversing a structure and keeping track of visited nodes by their memory locations is already provided by the data-reify package. Since we will be using it for profiling, we need to install it with profiling enabled (-p).
cabal install -p data-reify
Using Data.Reify requires the TypeFamilies extension and Control.Applicative.
{-# LANGUAGE TypeFamilies #-}
import Data.Reify
import Control.Applicative
We reproduce your Tree code.
data Tree a = Empty | Node a (Tree a) (Tree a)
complete :: a -> Integer -> Maybe (Tree a)
complete x depth
| depth < 0 = Nothing
| otherwise = Just $ complete' depth
where complete' d
| d == 0 = Empty
| otherwise = let copiedTree = complete' (d-1)
in Node x copiedTree copiedTree
Converting data to a graph with data-reify requires that we have a base functor for the data type. The base functor is a representation of the type with explicit recursion removed. The base functor for Tree is TreeF. An additional type parameter is added for the representation of recursive occurrence of the type, and each recursive occurrence is replaced by the new parameter.
data TreeF a x = EmptyF | NodeF a x x
deriving (Show)
The MuRef instance required by reifyGraph requires that we provide a mapDeRef to traverse the structure with an Applicative and convert it to the base functor . The first argument provided to mapDeRef, which I have named deRef, is how we can convert the recursive occurrences of the structure.
instance MuRef (Tree a) where
type DeRef (Tree a) = TreeF a
mapDeRef deRef Empty = pure EmptyF
mapDeRef deRef (Node a l r) = NodeF a <$> deRef l <*> deRef r
We can make a little program to run to test the complete function. When the graph is small, we'll print it out to see what's going on. When the graph gets big, we'll only print out how many nodes it has.
main = do
d <- getLine
let (Just tree) = complete 0 (read d)
graph#(Graph nodes _) <- reifyGraph tree
if length nodes < 30
then print graph
else print (length nodes)
I put this code in a file named profileSymmetricTree.hs. To compile it, we need to enable profiling with -prof and enable the run-time system with -rtsopts.
ghc -fforce-recomp -O2 -prof -fprof-auto -rtsopts profileSymmetricTree.hs
When we run it, we'll enable the time profile with the +RTS option -p. We'll give it the depth input 3 for the first run.
profileSymmetricTree +RTS -p
3
let [(1,NodeF 0 2 2),(2,NodeF 0 3 3),(3,NodeF 0 4 4),(4,EmptyF)] in 1
We can already see from the graph that the nodes are being shared between the left and right sides of the tree.
The profiler makes a file, profileSymmetricTree.prof.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.7 100.0 100.0
main Main 87 0 100.0 21.6 100.0 32.5
...
main.(...) Main 88 1 0.0 4.8 0.0 5.1
complete Main 90 1 0.0 0.0 0.0 0.3
complete.complete' Main 92 4 0.0 0.2 0.0 0.3
complete.complete'.copiedTree Main 94 3 0.0 0.1 0.0 0.1
It shows in the entries column that complete.complete' was executed 4 times, and the complete.complete'.copiedTree was evaluated 3 times.
If you repeat this experiment with different depths, and plot the results, you should get a good idea what the practical asymptotic performance of complete is.
Here are the profiling results for a much greater depth, 300000.
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 43 0 0.0 0.0 100.0 100.0
main Main 87 0 2.0 0.0 99.9 100.0
...
main.(...) Main 88 1 0.0 0.0 2.1 5.6
complete Main 90 1 0.0 0.0 2.1 5.6
complete.complete' Main 92 300001 1.3 4.4 2.1 5.6
complete.complete'.copiedTree Main 94 300000 0.8 1.3 0.8 1.3