I have an optimization model which turns to be very difficult to build. This model has many if-else conditions and many loops as well. So I was thinking of using multi-threading for building this single JuMP model object.
A very simplified version of one loop of the code looks like this:
Threads.#threads for g in sets["A"]
Array_1 = [gg for gg in [sets["B"];sets["A"]] if data2[gg] == g]
Array_2 = [gg for gg in sets["B"] if data[gg] == g]
for t in STAGES
Array_3 = [gg for gg in [sets["B"];sets["A"]] if data2[gg] == g && (gg, t) in sets["C"] ]
for b in BLOCKS
name = #constraint( model, ((g, t, b) in sets["C"] ? X1[(g,t,b)] : 0)
- sum(X1[(gg,t,b)] for gg in Array_3 )
+ X2[(g,t,b)] - sum(X2[(gg,t,b)] for gg in Array_1)
- sum(data3[gg] for gg in Array_2) == data4[(g, t, b)])
end
end
a=string("con_",g,"_",t,"_",b)
JuMP.set_name(name,a)
end
I have many of those loops with many if-else conditions inside. So I added #Threads.threads before the first for g in sets["A"] aiming to reduce the time of building the model.
The problem is that I obtain an ERROR: LoadError: TaskFailedException: UndefRefError: access to undefined reference when renaming the constraint. Is there any problem about my approach? If I don't put the Threads.#threads there isn't any problem at all, it just works very slow.
Some information about the infrastructure:
julia> versioninfo()
Julia Version 1.4.1
Commit 381693d3df* (2020-04-14 17:20 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2660 v3 # 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-8.0.1 (ORCJIT, haswell)
Environment:
JULIA_NUM_THREADS = 40
and packages:
(#v1.4) pkg> status
Status `~/.julia/environments/v1.4/Project.toml`
[c7e460c6] ArgParse v1.1.0
[a076750e] CPLEX v0.6.6
[336ed68f] CSV v0.7.7
[e2554f3b] Clp v0.8.1
[a93c6f00] DataFrames v0.21.7
[5789e2e9] FileIO v1.4.3
[2e9cd046] Gurobi v0.8.1
[033835bb] JLD2 v0.2.1
[4076af6c] JuMP v0.21.5
[438e738f] PyCall v1.91.4
[2913bbd2] StatsBase v0.33.1
[bd369af6] Tables v1.0.5
[6dd1b50a] Tulip v0.6.2
[1a1011a3] SharedArrays
[10745b16] Statistics
Thanks in advance!
Full stacktrace:
ERROR: LoadError: TaskFailedException:
UndefRefError: access to undefined reference
Stacktrace:
[1] getindex at ./array.jl:788 [inlined]
[2] ht_keyindex2!(::Dict{MathOptInterface.ConstraintIndex,String}, ::MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64},MathOptInterface.EqualTo{Float64}}) at ./dict.jl:326
[3] setindex!(::Dict{MathOptInterface.ConstraintIndex,String}, ::String, ::MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64},MathOptInterface.EqualTo{Float64}}) at ./dict.jl:381
[4] set at /home/user/.julia/packages/MathOptInterface/k7UUH/src/Utilities/model.jl:349 [inlined]
[5] set at /home/user/.julia/packages/MathOptInterface/k7UUH/src/Utilities/universalfallback.jl:354 [inlined]
[6] set(::MathOptInterface.Utilities.CachingOptimizer{MathOptInterface.AbstractOptimizer,MathOptInterface.Utilities.UniversalFallback{MathOptInterface.Utilities.Model{Float64}}}, ::MathOptInterface.ConstraintName, ::MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64},MathOptInterface.EqualTo{Float64}}, ::String) at /home/user/.julia/packages/MathOptInterface/k7UUH/src/Utilities/cachingoptimizer.jl:646
[7] set(::Model, ::MathOptInterface.ConstraintName, ::ConstraintRef{Model,MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64},MathOptInterface.EqualTo{Float64}},ScalarShape}, ::String) at /home/user/.julia/packages/JuMP/qhoVb/src/JuMP.jl:903
[8] set_name(::ConstraintRef{Model,MathOptInterface.ConstraintIndex{MathOptInterface.ScalarAffineFunction{Float64},MathOptInterface.EqualTo{Float64}},ScalarShape}, ::String) at /home/user/.julia/packages/JuMP/qhoVb/src/constraints.jl:68
[9] macro expansion at /home/user/code/model_formulation.jl:117 [inlined]
[10] (::var"#20#threadsfor_fun#255"{Dict{Any,Any},Dict{Any,Any},JuMP.Containers.DenseAxisArray{VariableRef,1,Tuple{Array{Tuple{String,Int64,Int64},1}},Tuple{Dict{Tuple{String,Int64,Int64},Int64}}},JuMP.Containers.DenseAxisArray{VariableRef,1,Tuple{Array{Tuple{String,Int64,Int64},1}},Tuple{Dict{Tuple{String,Int64,Int64},Int64}}},Array{String,1}})(::Bool) at ./threadingconstructs.jl:61
[11] (::var"#20#threadsfor_fun#255"{Dict{Any,Any},Dict{Any,Any},JuMP.Containers.DenseAxisArray{VariableRef,1,Tuple{Array{Tuple{String,Int64,Int64},1}},Tuple{Dict{Tuple{String,Int64,Int64},Int64}}},JuMP.Containers.DenseAxisArray{VariableRef,1,Tuple{Array{Tuple{String,Int64,Int64},1}},Tuple{Dict{Tuple{String,Int64,Int64},Int64}}},Array{String,1}})() at ./threadingconstructs.jl:28
Stacktrace:
[1] wait(::Task) at ./task.jl:267
[2] macro expansion at ./threadingconstructs.jl:69 [inlined]
[3] model_formulation(::Dict{Any,Any}, ::Dict{Any,Any}, ::Dict{Any,Any}, ::Dict{String,Bool}, ::String) at /home/user/code/model_formulation.jl:102
[4] functionA(::Dict{Any,Any}, ::Dict{Any,Any}, ::Dict{Any,Any}, ::String, ::Dict{String,Bool}) at /home/user/code/functionA.jl:178
[5] top-level scope at /home/user/code/main.jl:81
[6] include(::Module, ::String) at ./Base.jl:377
[7] exec_options(::Base.JLOptions) at ./client.jl:288
[8] _start() at ./client.jl:484
in expression starting at /home/user/code/main.jl:81
You have two options to parallelize a JuMP optimization model
Run a multi-threaded version of the Solver (provided that the solver supports it) - in that case the parallelism is fully handled by the external solver library and your Julia process remains single-threaded.
Run several single-threaded solver processes in parallel threads controlled by Julia. In this case several copies of the model need to be separately created which you can try to send to the solver at the same time.
#1:
Solvers support parameters including multi-threading control (on the other hand they might be simply using all available threads by default). Here is an example with Gurobi:
using JuMP, Gurobi
m = Model(optimizer_with_attributes(Gurobi.Optimizer, "Threads" => 2))
#variable(m, 0 <= x <= 2)
#variable(m, 0 <= y <= 30)
#objective(m, Max, 5x + 3 * y)
#constraint(m, con, 1x + 5y <= 3)
optimize!(m) # the model will be optimized using 2 threads
#2:
Running many solver copies in parallel you need to have separate model copies. In my code they differ by the range for x parameter:
Threads.#threads for z in 1:4
m = Model(optimizer_with_attributes(Gurobi.Optimizer, "Threads" => 1))
#variable(m, 0 <= x <= z)
#variable(m, 0 <= y <= 30)
#objective(m, Max, 5x + 3 * y)
#constraint(m, con, 1x + 5y <= 3)
optimize!(m)
#todo collect results
end
These are two separate approaches and you cannot mix them. If you parallelize execution each thread needs to get a separate model copy because JuMP mutates the Model object.
Related
In Python, it is usually suggested to vectorize the code to make computation faster. For example, if you want to compute the inner product of two vectors, say a and b, usually
Code A
c = np.dot(a, b)
is better than
Code B
c = 0
for i in range(len(a)):
c += a[i] * b[i]
But in Julia, it seems sometimes vectorization is not that helpful. I reckoned '* and dot as vectorized versions and an explicit for loop as a non-vectorized version and got the following result.
using Random
using LinearAlgebra
len = 1000000
rng1 = MersenneTwister(1234)
a = rand(rng1, len)
rng2 = MersenneTwister(12345)
b = rand(rng2, len)
function inner_product(a, b)
result = 0
for i in 1: length(a)
result += a[i] * b[i]
end
return result
end
#time a' * b
#time dot(a, b)
#time inner_product(a, b);
0.013442 seconds (56.76 k allocations: 3.167 MiB)
0.003484 seconds (106 allocations: 6.688 KiB)
0.008489 seconds (17.52 k allocations: 976.752 KiB)
(I know using BenchmarkTools.jl is a more standard way to measure the performance.)
From the output, dot runs faster than for than '*, which is a contradiction to what has been presumed.
So my question is,
does Julia need (or sometimes need) vectorization to speed up computation?
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Your are not making the benchmarks correctly and the implementation of your function is suboptimal.
julia> using BenchmarkTools
julia> #btime $a' * $b
429.400 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime dot($a,$b)
426.299 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime inner_product($a, $b)
970.500 μs (0 allocations: 0 bytes)
249985.36801903677
The correct implementation:
function inner_product_correct(a, b)
result = 0.0 #use the same type as elements in the args
#simd for i in 1: length(a)
#inbounds result += a[i] * b[i]
end
return result
end
julia> #btime inner_product_correct($a, $b)
530.499 μs (0 allocations: 0 bytes)
249985.36801902478
There is still the difference (however less significant) because dot is using the optimized BLAS implementation which is multi-threaded. You could (following Bogumil's comment set OPENBLAS_NUM_THREADS=1 and then you will find that the times of BLAS function will be identical as the Julia implementation.
Note also that working with float numbers is tricky in many ways:
julia> inner_product_correct(a, b)==dot(a,b)
false
julia> inner_product_correct(a, b) ≈ dot(a,b)
true
Finally, in Julia deciding whether to use vectorization or write the loop yourself is up two you - there is no performance penalty (as long as you write type stable code and use #simd and #inbounds where required). However in your codes you were not testing vectorization but you were comparing calling BLAS to writing the loop yourself. Here is the must-read to understand what is going on https://docs.julialang.org/en/v1/manual/performance-tips/
Let me add my practical experience as a comment (too long for a standard comment):
does Julia need (or sometimes need) vectorization to speed up computation?
Julia does not need vectorization as Python (see the answer by Przemysław), but in practice if you have a well written vectorized function (like dot) then use it as, while possible, it might be sometimes tricky to write as performant function yourself (people have probably spent days on optimizing dot, especially to optimally use multiple threads).
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
When you use vectorized code then it all depends on implementation of the function you want to use. In this case dot(a, b) and a' * b are exactly the same as when you write #edit a' * b gives you in this case:
*(u::AdjointAbsVec{<:Number}, v::AbstractVector{<:Number}) = dot(u.parent, v)
and you see it is the same.
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Julia is a compiled language, while Python is an interpreted language. In some cases Python interpreter can provide fast execution speed, but in other cases it currently is not able to do it (but it does not mean that in the future it will not improve). In particular vectorized functions (like dot in your question) are most likely written in some compiled language, so Julia and Python will not differ much in typical cases as they just call this compiled function. However, when you use loops (non-vectorized code) then currently Python will be slower than Julia.
I'm trying to solve a Constraint Satisfaction Optimisation Problem that assigns agents to tasks. However, different then the basic Assignment Problem, a agent can be assigned to many tasks if the tasks do not overlap.
Each task has a fixed start_time and end_time.
The agents are assigned to the tasks according to some unary&binary constraints.
Variables = set of tasks
Domain = set of compatible agents (for each variable)
Constraints = unary&binary
Optimisation fct = some liniar function
An example of the problem: the allocation of parking space (or teams) for trucks for which we know the arrival and departure time.
I'm interested if there is in the literature a precise name for these type of problems. I presume it is some kind of assignment problem. Also, if you ever approach the problem, how do you solve it?
Thank you.
I would interpret this as: rectangular assignment-problem with conflicts
which is arguably much more hard (NP-hard in general) than the polynomially-solvable assignment-problem.
The demo shown in the other answer might work and ortools' cp-sat is great, but i don't see a good reason to use discrete-time based reasoning here like it's done: interval-variables, edge-finding and co. based scheduling constraints (+ conflict-analysis / explanations). This stuff is total overkill and the overhead will be noticable. I don't see any need to reason about time, but just about time-induced conflicts.
Edit: One could label those two approaches (linked + proposed) as compact formulation and extended formulation. Extended formulations usually show stronger relaxations and better (solving) results as long as scalability is not an issue. Compact approaches might become more viable again with bigger data (bit it's hard to guess here as scheduling-propagators are not that cheap).
What i would propose:
(1) Formulate an integer-programming model following the basic assignment-problem formulation + adaptions to make it rectangular -> a worker is allowed to tackle multiple tasks while all tasks are tackled (one sum-equality dropped)
see wiki
(2) Add integrality = mark variables as binary -> because the problem is not satisfying total unimodularity anymore
(3) Add constraints to forbid conflicts
(4) Add constraints: remaining stuff (e.g. compatibility)
Now this is all straightforward, but i would propose one non-naive improvement in regards to (3):
The conflicts can be interpreted as stable-set polytope
Your conflicts are induced by a-priori defined time-windows and their overlappings (as i interpret it; this is the core assumption behind this whole answer)
This is an interval graph (because of time-windows)
All interval graphs are chordal
Chordal graphs allow enumeration of all max-cliques in poly-time (implying there are only polynomial many)
python: networkx.algorithms.chordal.chordal_graph_cliques
The set (enumeration) of all maximal cliques define the facets of the stable-set polytope
Those (a constraint for each element in the set) we add as constraints!
(The stable-set polytope on the graph in use here would also allow very very powerful semidefinite-relaxations but it's hard to foresee in which cases this would actually help due to SDPs being much more hard to work with: warmstart within tree-search; scalability; ...)
This will lead to a poly-size integer-programming problem which should be very very good when using a good IP-solver (commercials or if open-source needed: Cbc > GLPK).
Small demo about (3)
import itertools
import networkx as nx
# data: inclusive, exclusive
# --------------------------
time_windows = [
(2, 7),
(0, 10),
(6, 12),
(12, 20),
(8, 12),
(16, 20)
]
# helper
# ------
def is_overlapping(a, b):
return (b[1] > a[0] and b[0] < a[1])
# raw conflicts
# -------------
binary_conflicts = []
for a, b in itertools.combinations(range(len(time_windows)), 2):
if is_overlapping(time_windows[a], time_windows[b]):
binary_conflicts.append( (a, b) )
# conflict graph
# --------------
G = nx.Graph()
G.add_edges_from(binary_conflicts)
# maximal cliques
# ---------------
max_cliques = nx.chordal_graph_cliques(G)
print('naive constraints: raw binary conflicts')
for i in binary_conflicts:
print('sum({}) <= 1'.format(i))
print('improved constraints: clique-constraints')
for i in max_cliques:
print('sum({}) <= 1'.format(list(i)))
Output:
naive constraints: raw binary conflicts
sum((0, 1)) <= 1
sum((0, 2)) <= 1
sum((1, 2)) <= 1
sum((1, 4)) <= 1
sum((2, 4)) <= 1
sum((3, 5)) <= 1
improved constraints: clique-constraints
sum([1, 2, 4]) <= 1
sum([0, 1, 2]) <= 1
sum([3, 5]) <= 1
Fun facts:
Commercial integer-programming solvers and maybe even Cbc might even try to do the same reasoning about clique-constraints to some degree although without the assumption of chordality where it's an NP-hard problem
ortools' cp-sat solver has also a code-path for this (again: general NP-hard case)
Should trigger when expressing the conflict-based model (much harder to decide on this exploitation on general discrete-time based scheduling models)
Caveats
Implementation / Scalability
There are still open questions like:
duplicating max-clique constraints over each worker vs. merging them somehow
be more efficient/clever in finding conflicts (sorting)
will it scale to the data: how big will the graph be / how many conflicts and constraints from those do we need
But those things usually follow instance-statistics (aka "don't decide blindly").
I don't know a name for the specific variant you're describing - maybe others would. However, this indeed seems a good fit for a CP/MIP solver; I would go with the OR-Tools CP-SAT solver, which is free, flexible and usually works well.
Here's a reference implementation with Python, assuming each vehicle requires a team assigned to it with no overlaps, and that the goal is to minimize the number of teams in use.
The framework allows to directly model allowed / forbidden assignments (check out the docs)
from ortools.sat.python import cp_model
model = cp_model.CpModel()
## Data
num_vehicles = 20
max_teams = 10
# Generate some (semi-)interesting data
interval_starts = [i % 9 for i in range(num_vehicles)]
interval_len = [ (num_vehicles - i) % 6 for i in range(num_vehicles)]
interval_ends = [ interval_starts[i] + interval_len[i] for i in range(num_vehicles)]
### variables
# t, v is true iff vehicle v is served by team t
team_assignments = {(t, v): model.NewBoolVar("team_assignments_%i_%i" % (t, v)) for t in range(max_teams) for v in range(num_vehicles)}
#intervals for vehicles. Each interval can be active or non active, according to team_assignments
vehicle_intervals = {(t, v): model.NewOptionalIntervalVar(interval_starts[v], interval_len[v], interval_ends[v], team_assignments[t, v], 'vehicle_intervals_%i_%i' % (t, v))
for t in range(max_teams) for v in range(num_vehicles)}
team_in_use = [model.NewBoolVar('team_in_use_%i' % (t)) for t in range(max_teams)]
## constraints
# non overlap for each team
for t in range(max_teams):
model.AddNoOverlap([vehicle_intervals[t, v] for v in range(num_vehicles)])
# each vehicle must be served by exactly one team
for v in range(num_vehicles):
model.Add(sum(team_assignments[t, v] for t in range(max_teams)) == 1)
# what teams are in use?
for t in range(max_teams):
model.AddMaxEquality(team_in_use[t], [team_assignments[t, v] for v in range(num_vehicles)])
#symmetry breaking - use teams in-order
for t in range(max_teams-1):
model.AddImplication(team_in_use[t].Not(), team_in_use[t+1].Not())
# let's say that the goal is to minimize the number of teams required
model.Minimize(sum(team_in_use))
solver = cp_model.CpSolver()
# optional
# solver.parameters.log_search_progress = True
# solver.parameters.num_search_workers = 8
# solver.parameters.max_time_in_seconds = 5
result_status = solver.Solve(model)
if (result_status == cp_model.INFEASIBLE):
print('No feasible solution under constraints')
elif (result_status == cp_model.OPTIMAL):
print('Optimal result found, required teams=%i' % (solver.ObjectiveValue()))
elif (result_status == cp_model.FEASIBLE):
print('Feasible (non optimal) result found')
else:
print('No feasible solution found under constraints within time')
# Output:
#
# Optimal result found, required teams=7
EDIT:
#sascha suggested a beautiful approach for analyzing the (known in advance) time window overlaps, which would make this solvable as an assignment problem.
So while the formulation above might not be the optimal one for this (although it could be, depending on how the solver works), I've tried to replace the no-overlap conditions with the max-clique approach suggested - full code below.
I did some experiments with moderately large problems (100 and 300 vehicles), and it seems empirically that on smaller problems (~100) this does improve by some - about 15% on average on the time to optimal solution; but I could not find a significant improvement on the larger (~300) problems. This might be either because my formulation is not optimal; because the CP-SAT solver (which is also a good IP solver) is smart enough; or because there's something I've missed :)
Code:
(this is basically the same code from above, with the logic to support using the network approach instead of the no-overlap one copied from #sascha's answer):
from timeit import default_timer as timer
from ortools.sat.python import cp_model
model = cp_model.CpModel()
run_start_time = timer()
## Data
num_vehicles = 300
max_teams = 300
USE_MAX_CLIQUES = True
# Generate some (semi-)interesting data
interval_starts = [i % 9 for i in range(num_vehicles)]
interval_len = [ (num_vehicles - i) % 6 for i in range(num_vehicles)]
interval_ends = [ interval_starts[i] + interval_len[i] for i in range(num_vehicles)]
if (USE_MAX_CLIQUES):
## Max-cliques analysis
# for the max-clique approach
time_windows = [(interval_starts[i], interval_ends[i]) for i in range(num_vehicles)]
def is_overlapping(a, b):
return (b[1] > a[0] and b[0] < a[1])
# raw conflicts
# -------------
binary_conflicts = []
for a, b in itertools.combinations(range(len(time_windows)), 2):
if is_overlapping(time_windows[a], time_windows[b]):
binary_conflicts.append( (a, b) )
# conflict graph
# --------------
G = nx.Graph()
G.add_edges_from(binary_conflicts)
# maximal cliques
# ---------------
max_cliques = nx.chordal_graph_cliques(G)
##
### variables
# t, v is true iff point vehicle v is served by team t
team_assignments = {(t, v): model.NewBoolVar("team_assignments_%i_%i" % (t, v)) for t in range(max_teams) for v in range(num_vehicles)}
#intervals for vehicles. Each interval can be active or non active, according to team_assignments
vehicle_intervals = {(t, v): model.NewOptionalIntervalVar(interval_starts[v], interval_len[v], interval_ends[v], team_assignments[t, v], 'vehicle_intervals_%i_%i' % (t, v))
for t in range(max_teams) for v in range(num_vehicles)}
team_in_use = [model.NewBoolVar('team_in_use_%i' % (t)) for t in range(max_teams)]
## constraints
# non overlap for each team
if (USE_MAX_CLIQUES):
overlap_constraints = [list(l) for l in max_cliques]
for t in range(max_teams):
for l in overlap_constraints:
model.Add(sum(team_assignments[t, v] for v in l) <= 1)
else:
for t in range(max_teams):
model.AddNoOverlap([vehicle_intervals[t, v] for v in range(num_vehicles)])
# each vehicle must be served by exactly one team
for v in range(num_vehicles):
model.Add(sum(team_assignments[t, v] for t in range(max_teams)) == 1)
# what teams are in use?
for t in range(max_teams):
model.AddMaxEquality(team_in_use[t], [team_assignments[t, v] for v in range(num_vehicles)])
#symmetry breaking - use teams in-order
for t in range(max_teams-1):
model.AddImplication(team_in_use[t].Not(), team_in_use[t+1].Not())
# let's say that the goal is to minimize the number of teams required
model.Minimize(sum(team_in_use))
solver = cp_model.CpSolver()
# optional
solver.parameters.log_search_progress = True
solver.parameters.num_search_workers = 8
solver.parameters.max_time_in_seconds = 120
result_status = solver.Solve(model)
if (result_status == cp_model.INFEASIBLE):
print('No feasible solution under constraints')
elif (result_status == cp_model.OPTIMAL):
print('Optimal result found, required teams=%i' % (solver.ObjectiveValue()))
elif (result_status == cp_model.FEASIBLE):
print('Feasible (non optimal) result found, required teams=%i' % (solver.ObjectiveValue()))
else:
print('No feasible solution found under constraints within time')
print('run time: %.2f sec ' % (timer() - run_start_time))
I'm trying the multithread function of Julia 1.3 with the following Hardware:
Model Name: MacBook Pro
Processor Name: Intel Core i7
Processor Speed: 2.8 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
When running the following script:
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
#time F(43)
it gives me the following output
2.229305 seconds (2.00 k allocations: 103.924 KiB)
433494437
However when running the following code copied from the Julia page about multithreading
import Base.Threads.#spawn
function fib(n::Int)
if n < 2
return n
end
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
end
fib(43)
what happens is that the utilisation of RAM/CPU jumps from 3.2GB/6% to 15GB/25% without any output (for at least 1 minute, after which i decided to kill the julia session)
What am I doing wrong?
Great question.
This multithreaded implementation of the Fibonacci function is not faster than the single threaded version. That function was only shown in the blog post as a toy example of how the new threading capabilities work, highlighting that it allows for spawning many many threads in different functions and the scheduler will figure out an optimal workload.
The problem is that #spawn has a non-trivial overhead of around 1µs, so if you spawn a thread to do a task that takes less than 1µs, you've probably hurt your performance. The recursive definition of fib(n) has exponential time complexity of order 1.6180^n [1], so when you call fib(43), you spawn something of order 1.6180^43 threads. If each one takes 1µs to spawn, it'll take around 16 minutes just to spawn and schedule the needed threads, and that doesn't even account for the time it takes to do the actual computations and re-merge / sync threads which takes even more time.
Things like this where you spawn a thread for each step of a computation only make sense if each step of the computation takes a long time compared to the #spawn overhead.
Note that there is work going into lessening the overhead of #spawn, but by the very physics of multicore silicon chips I doubt it can ever be fast enough for the above fib implementation.
If you're curious about how we could modify the threaded fib function to actually be beneficial, the easiest thing to do would be to only spawn a fib thread if we think it will take significantly longer than 1µs to run. On my machine (running on 16 physical cores), I get
function F(n)
if n < 2
return n
else
return F(n-1)+F(n-2)
end
end
julia> #btime F(23);
122.920 μs (0 allocations: 0 bytes)
so thats a good two orders of magnitude over the cost of spawning a thread. That seems like a good cutoff to use:
function fib(n::Int)
if n < 2
return n
elseif n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return fib(n-1) + fib(n-2)
end
end
now, if I follow proper benchmark methodology with BenchmarkTools.jl [2] I find
julia> using BenchmarkTools
julia> #btime fib(43)
971.842 ms (1496518 allocations: 33.64 MiB)
433494437
julia> #btime F(43)
1.866 s (0 allocations: 0 bytes)
433494437
#Anush asks in the comments: This is a factor of 2 speed up using 16 cores it seems. Is it possible to get something closer to a factor of 16 speed up?
Yes it is. The problem with the above function is that the function body is larger than that of F, with lots of conditionals, function / thread spawning and all that. I invite you to compare #code_llvm F(10) #code_llvm fib(10). This means that fib is much harder for julia to optimize. This extra overhead it makes a world of difference for the small n cases.
julia> #btime F(20);
28.844 μs (0 allocations: 0 bytes)
julia> #btime fib(20);
242.208 μs (20 allocations: 320 bytes)
Oh no! all that extra code that never gets touched for n < 23 is slowing us down by an order of magnitude! There's an easy fix though: when n < 23, don't recurse down to fib, instead call the single threaded F.
function fib(n::Int)
if n > 23
t = #spawn fib(n - 2)
return fib(n - 1) + fetch(t)
else
return F(n)
end
end
julia> #btime fib(43)
138.876 ms (185594 allocations: 13.64 MiB)
433494437
which gives a result closer to what we'd expect for so many threads.
[1] https://www.geeksforgeeks.org/time-complexity-recursive-fibonacci-program/
[2] The BenchmarkTools #btime macro from BenchmarkTools.jl will run functions multiple times, skipping the compilation time and average results.
#Anush
As an example of using memoization and multithreading manually
_fib(::Val{1}, _, _) = 1
_fib(::Val{2}, _, _) = 1
import Base.Threads.#spawn
_fib(x::Val{n}, d = zeros(Int, n), channel = Channel{Bool}(1)) where n = begin
# lock the channel
put!(channel, true)
if d[n] != 0
res = d[n]
take!(channel)
else
take!(channel) # unlock channel so I can compute stuff
#t = #spawn _fib(Val(n-2), d, channel)
t1 = _fib(Val(n-2), d, channel)
t2 = _fib(Val(n-1), d, channel)
res = fetch(t1) + fetch(t2)
put!(channel, true) # lock channel
d[n] = res
take!(channel) # unlock channel
end
return res
end
fib(n) = _fib(Val(n), zeros(Int, n), Channel{Bool}(1))
fib(1)
fib(2)
fib(3)
fib(4)
#time fib(43)
using BenchmarkTools
#benchmark fib(43)
But the speed up came from memmiozation and not so much multithreading. Lesson here is that we should think better algorithms before multithreading.
I'm writing a an ETL process to read event level data from a product database, transform / aggregate it and write to to an analytics data warehouse. I'm using clojure's core.async library to separate these process into concurrently executing components. Here's what the main part of my code looks like right now
(ns data-staging.main
(:require [clojure.core.async :as async])
(:use [clojure.core.match :only (match)]
[data-staging.map-vecs]
[data-staging.tables])
(:gen-class))
(def submissions (make-table "Submission" "Valid"))
(def photos (make-table "Photo"))
(def videos (make-table "Video"))
(def votes (make-table "Votes"))
;; define channels used for sequential data processing
(def chan-in (async/chan 100))
(def chan-out (async/chan 100))
(defn write-thread [table]
"infinitely loops between reading subsequent 10000 rows from
table and ouputting a vector of the rows(maps)
into 'chan-in'"
(while true
(let [next-rows (get-rows table)]
(async/>!! chan-in next-rows)
(set-max table (:max-id (last next-rows))))))
(defn aggregator []
"takes output from 'chan-in' and aggregates it by coupon_id, date.
then adds / drops any fields that are needed / not needed and inputs
into 'chan-out'"
(while true
(->>
(async/<!! chan-in)
aggregate
(async/>!! chan-out))))
(defn read-thread []
"reads data from chan out and interts into Analytics DB"
(while true
(upsert (async/<!! chan-out))))
(defn -main []
(async/thread (write-thread submissions))
(async/thread (write-thread photos))
(async/thread (write-thread videos))
(async/thread-call aggregator)
(async/thread-call read-thread))
As you can see, I'm putting each os component on to its own thread and using the blocking >!! call on the channels. It feels like using the non-blocking >! calls along with go routines might be better for this use case, especially for the database reads which spend most of their time performing i/o and waiting for new rows in the product db. Is this the case, and if so, what would be the best way to implement it? I'm a little unclear on all the tradeoffs between the two methods and on exactly how to effectively use go routines. Also any other suggestions on how to improve the overall architecture would be much appreciated!
Personally, I think your use of threads here is probably the right call. The magic non-blocking nature of go-blocks comes from "parking," which is a special sort of pseudo-blocking that core.async's state machine uses — but since your database calls genuinely block instead of putting the state machine into a parked state, you'd just be blocking some thread from the core.async thread pool. It does depend on how long your synchronous calls take, so this is the sort of thing where benchmarks can be informative, but I strongly suspect threads are the right approach here.
The one exception is your aggregator function. It looks to me like it could just be folded into the definition of chan-out, as (def chan-out (map< aggregate chan-in)).
For a general overview of go-blocks versus threads, Martin Trojer wrote a good examination of the two approaches and which one is faster in which situation. The Cliff's Notes version is that go-blocks are good for adapting already-asynchronous libraries for use with core.async, while threads are good for making asynchronous processes out of synchronous parts. If your database had a callback-based API, for example, then go-blocks would be a definite win. But since it is synchronous, they are not a good fit.
i think that it would be a better approach to use "go" macros to have non-blocking-threads in this ETL case.
I've written a very simple code to achieve the synchronized sequence of processes implied in Extract Transform and Load tasks
Type on your repl the following code:
(require '[clojure.core.async :as async :refer [<! >! <!! timeout chan alt! go]])
(def output(chan))
(defn extract [origin]
(let [value-extracted (chan)
value-transformed (chan)
value-loaded (chan)]
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-extracted (str origin " > extracted ")))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-transformed (str (<! value-extracted) " > transformed " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-loaded (str (<! value-transformed) " > loaded " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! output [origin (<! value-loaded)]))))
(go
(loop [origins-already-loaded []]
(let [[id message] (<! output)
origins-updated (conj origins-already-loaded id)]
(println message)
(println origins-updated)
(recur origins-updated)
)
))
Type on the repl:
(doseq [example (take 10 (range))] (extract example))
1 > extracted > transformed > loaded
[1]
7 > extracted > transformed > loaded
[1 7]
0 > extracted > transformed > loaded
[1 7 0]
8 > extracted > transformed > loaded
[1 7 0 8]
3 > extracted > transformed > loaded
[1 7 0 8 3]
6 > extracted > transformed > loaded
[1 7 0 8 3 6]
2 > extracted > transformed > loaded
[1 7 0 8 3 6 2]
5 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5]
9 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9]
4 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9 4]
UPDATE:
the error fixed was to use <!! (timeout (+ 100 (* 100 (rand-int 20))))) inside the removed function "wait-a-while" that was blocking the others no blocking go processes
In the mathematical languages, you can create a vector as follows:
x = seq(0, 2*pi, length.out = 100)
This outputs:
[1] 0.00000000 0.06346652 0.12693304 0.19039955 0.25386607 0.31733259 0.38079911
[8] 0.44426563 0.50773215 0.57119866 0.63466518 0.69813170 0.76159822 0.82506474
[15] 0.88853126 0.95199777 1.01546429 1.07893081 1.14239733 1.20586385 1.26933037
[22] 1.33279688 1.39626340 1.45972992 1.52319644 1.58666296 1.65012947 1.71359599
[29] 1.77706251 1.84052903 1.90399555 1.96746207 2.03092858 2.09439510 2.15786162
[36] 2.22132814 2.28479466 2.34826118 2.41172769 2.47519421 2.53866073 2.60212725
[43] 2.66559377 2.72906028 2.79252680 2.85599332 2.91945984 2.98292636 3.04639288
[50] 3.10985939 3.17332591 3.23679243 3.30025895 3.36372547 3.42719199 3.49065850
[57] 3.55412502 3.61759154 3.68105806 3.74452458 3.80799110 3.87145761 3.93492413
[64] 3.99839065 4.06185717 4.12532369 4.18879020 4.25225672 4.31572324 4.37918976
[71] 4.44265628 4.50612280 4.56958931 4.63305583 4.69652235 4.75998887 4.82345539
[78] 4.88692191 4.95038842 5.01385494 5.07732146 5.14078798 5.20425450 5.26772102
[85] 5.33118753 5.39465405 5.45812057 5.52158709 5.58505361 5.64852012 5.71198664
[92] 5.77545316 5.83891968 5.90238620 5.96585272 6.02931923 6.09278575 6.15625227
[99] 6.21971879 6.28318531
How can this be achieved in Haskell?
I tried creating a lambda function and using it with map, but I could n't get the same output.
Thanks
let myPi = (\x -> 2*pi)
map myPi [1..10]
Well, you can just do
[0, 2*pi/100 .. 2*pi]
Note that this is not ideal both performance- and floating-point-rounding–wise (because it translates to enumFromThenTo), Daniel Fischer's version is better (it translates to enumFromTo). Thinking it over, GHC will probably compile both to almost equally-fast code, but I'm not sure. If it's really performance-critical, it's best not to use lists at all but e.g. Data.Vector.
As Jakub Hampl remarked, Haskell can deal with infinite lists. That's probably not much use to you here, but it opens interesting possibilties – for instance, you might not be sure which resolution you actually need. You can let your list begin with a very low resolution, then fold back and start again with a higher one. One simple way to achieve this:
import Data.Fixed
multiResS₁ = [ log x `mod'` 2*pi | x<-[1 .. ] ]
using this to plot the sine function looks like this
Prelude Data.Fixed Graphics.Rendering.Chart.Simple> let domainS₁ = take 200 multiResS₁
Prelude Data.Fixed Graphics.Rendering.Chart.Simple> plotPNG "multiresS1.png" domainS₁ sin
Easiest is a list comprehension,
[(2*pi)*k/99 | k <- [0 .. 99]]
(the multiplication with k/99 mitigates the floating point rounding, so the last value is exactly 2*pi.)