What's a numerically-stable way of taking the variance of an iterator elementwise? As an example, I would like to do something like
var((rand(4,2) for i in 1:10))
and get back a (4,2) matrix which is the variance in each coefficient. This throws an error using Julia's Base var. Is there a package that can handle this? Or an easy (and storage-efficient) way to do this using the Base Julia function? Or does one need to be developed on its own?
I went ahead and implemented a Welford algorithm to calculate this:
# Welford algorithm
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
function componentwise_meanvar(A;bessel=true)
x0 = first(A)
n = 0
mean = zero(x0)
M2 = zero(x0)
delta = zero(x0)
delta2 = zero(x0)
for x in A
n += 1
delta .= x .- mean
mean .+= delta./n
delta2 .= x .- mean
M2 .+= delta.*delta2
end
if n < 2
return NaN
else
if bessel
M2 .= M2 ./ (n .- 1)
else
M2 .= M2 ./ n
end
return mean,M2
end
end
A few other algorithms are implemented in DiffEqMonteCarlo.jl as well. I'm surprised I couldn't find a library for this, but maybe will refactor this out someday.
See update below for a numerically stable version
Another method to calculate this:
srand(0) # reset random for comparing across implementations
moment2var(t) = (t[3]-t[2].^2./t[1])./(t[1]-1)
foldfunc(x,y) = (x[1]+1,x[2].+y,x[3].+y.^2)
moment2var(foldl(foldfunc,(0,zeros(1,1),zeros(1,1)),(rand(4,2) for i=1:10)))
Gives:
4×2 Array{Float64,2}:
0.0848123 0.0643537
0.0715945 0.0900416
0.111934 0.084314
0.0819135 0.0632765
Similar to:
srand(0) # reset random for comparing across implementations
# naive component-wise application of `var` function
map(var,zip((rand(4,2) for i=1:10)...))
which is the non-iterator version (or offline version in CS terminology).
This method is based on calculation of variance from mean and sum-of-squares. moment2var and foldfunc are just a helper functions, but it fits in one-line without them.
Comments:
Speedwise, this should be pretty good as well. Perhaps, StaticArrays and initializing the foldl's v0 with the correct eltype of the iterator would save even more time.
Benchmarking gave 5x speed advantage (and better memory usage) over componentwise_meanvar (from another answer) on a sample input.
Using moment2meanvar(t)=(t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1)) gives both mean and variance like componentwise_meanvar.
As #ChrisRackauckas noted, this method suffers from numerical instability when number of elements to sum is large.
--- UPDATE with variant of method ---
A little abstraction of the question asks for a way to do a foldl (and reduce,foldr) on an iterator returning a matrix, element-wise and retaining shape. To do so, we can define an assisting function mfold which takes a folding-function and makes it fold matrices element-wise. Define it as follows:
mfold(f) = (x,y)->[f(t[1],t[2]) for t in zip(x,y)]
For this specific problem of variance, we can define the component-wise fold functions, and a final function to combine the moments into the variance (and mean if wanted). The code:
ff(x,y) = (x[1]+1,x[2]+y,x[3]+y^2) # fold and collect moments
moment2var(t) = (t[3]-t[2]^2/t[1])/(t[1]-1) # calc variance from moments
moment2meanvar(t) = (t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1))
We can see moment2meanvar works on a single vector as follows:
julia> moment2meanvar(foldl(ff,(0.0,0.0,0.0),[1.0,2.0,3.0]))
(2.0, 1.0)
Now to matrix-ize it using foldm (using .-notation):
moment2var.(foldl(mfold(ff),fill((0,0,0),(4,2)),(rand(4,2) for i=1:10)))
#ChrisRackauckas noted this is not numerically stable, and another method (detailed in Wikipedia) is better. Using foldm this could be implemented as:
# better fold function compensating the sums for stability
ff2(x,y) = begin
delta=y-x[2]
mean=x[2]+delta/(x[1]+1)
return (x[1]+1,mean,x[3]+delta*(y-mean))
end
# combine the collected information for the variance (and mean)
m2var(t) = t[3]/(t[1]-1)
m2meanvar(t) = (t[2],t[3]/(t[1]-1))
Again we have:
m2var.(foldl(mfold(ff2),fill((0,0.0,0.0),(4,2)),(rand(4,2) for i=1:10)))
Giving the same results (perhaps a little more accurately).
Or an easy (and storage-efficient) way to do this using the Base Julia function?
Out of curiosity, why is the standard solution of using var along the external dimension not good for you?
julia> var(cat(3,(rand(4,2) for i in 1:10)...),3)
4×2×1 Array{Float64,3}:
[:, :, 1] =
0.08847 0.104799
0.0946243 0.0879721
0.105404 0.0617594
0.0762611 0.091195
Obviously, I'm using cat here, which clearly is not very storage efficient, just so I can use the Base Julia function and your original generator syntax as per your question. But you could make this storage efficient as well, if you initialise your random values directly on a preallocated array of size (4,2,10), so that's not really an issue here.
Or did I misunderstand your question?
EDIT - benchmark in response to comments
function standard_var(Y, A)
for i in 1 : length(A)
Y[:,:,i], = next(A,i);
end
var(Y,3)
end
function testit()
A = (rand(4,2) for i in 1:10000);
Y = Array{Float64, 3}(4,2,length(A));
#time componentwise_meanvar(A); # as defined in Chris's answer above
#time standard_var(Y, A) # standard variance + using preallocation
#time var(cat(3, A...), 3); # standard variance without preallocation
return nothing
end
julia> testit()
0.004258 seconds (10.01 k allocations: 1.374 MiB)
0.006368 seconds (49.51 k allocations: 2.129 MiB)
5.954470 seconds (50.19 M allocations: 2.989 GiB, 71.32% gc time)
Related
I want to solve the following optimization problem using Gekko in python 3.7 window version.
Original Problem
Here, x_s are continuous variables, D and Epsilon are deterministic and they are also parameters.
However, since minimization function exists in the objective function, I remove it using binary variables(z1, z2) and then the problem becomes MINLP as follows.
Modified problem
With Gekko,
(1) Can both original problem & modified problem be solved?
(2) How can I code summation in the objective function and also D & epsilon which are parameters in Gekko?
Thanks in advance.
Both problems should be feasible with Gekko but the original appears easier to solve. Here are a few suggestions for the original problem:
Use m.Maximize() for the objective
Use sum() for the inner summation and m.sum() for outer summation for the objective function. I switch to m.sum() when the summation would create an expression that is over 15,000 characters. Using sum() creates one long expression and m.sum() breaks the summation into pieces but takes longer to compile.
Use m.min3() for the min(Dt,xs) terms or slack variables s with x[i]+s[i]=D[i]. It appears that Dt (size 30) is an upper bound, but it has different dimensions that xs (size 100). Slack variables are much more efficient than using binary variables.
D = np.array(100)
x = m.Array(m.Var,100,lb=0,ub=2000000)
The modified problem has 6000 binary variables and 100 continuous variables. There are 2^6000 potential combinations of those variables so it may take a while to solve, even with the efficient branch and bound method of APOPT. Here are a few suggestions for the modified problem:
Use matrix multiplications when possible. Below is an example of matrix operations with Gekko.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
ni = 3; nj = 2; nk = 4
# solve AX=B
A = m.Array(m.Var,(ni,nj),lb=0)
X = m.Array(m.Var,(nj,nk),lb=0)
AX = np.dot(A,X)
B = m.Array(m.Var,(ni,nk),lb=0)
# equality constraints
m.Equations([AX[i,j]==B[i,j] for i in range(ni) \
for j in range(nk)])
m.Equation(5==m.sum([m.sum([A[i][j] for i in range(ni)]) \
for j in range(nj)]))
m.Equation(2==m.sum([m.sum([X[i][j] for i in range(nj)]) \
for j in range(nk)]))
# objective function
m.Minimize(m.sum([m.sum([B[i][j] for i in range(ni)]) \
for j in range(nk)]))
m.solve()
print(A)
print(X)
print(B)
Declare z1 and z2 variables as integer type with integer=True. Here is more information on using the integer type.
Solve locally with m=GEKKO(remote=False). The processing time will be large and the public server resets connections and deletes jobs every day. Switch to local mode to avoid a potential disruption.
New to Julia and just trying to implement a basic Bayesian model. I would like to evaluate the log-likelihood of each data point, where each data point has a different mean parameter depending on their corresponding covariate, without having to implement a for loop over all data points.
using Distributions
y = -50:1:49
a = 1
b = 1
N = 100
x = rand(Normal(0, 1), N)
mu = a .+ b.*x
sigma = 5
# Can we evaluate the logpdf of every point in one call to logpdf without doing a for loop
loglikelihood = logpdf(Normal(mu, sigma), y)
MethodError: no method matching Normal(::Vector{Float64}, ::Int64)
Edit: I would like to clarify that the mu specified above is a vector of the same dimensions as y, and that instead evaluating logpdf of each observation using the function Normal(::Real, ::Real) in an iterative procedure, I would like to something that handles something to the effect of
logpdf(Normal(::Array, ::Real), ::Array). The code I provide in the following chunk does what I want by taking the sum of the log-likelihood across observations, but I would prefer to not have to transform to a multivariate distribution.
using LinearAlgebra
logpdf(MvNormal(mu, diagm(repeat([sigma], outer=N))), y)
Thanks for your help.
Your code doesn't actually run, as there are undefined variables (a, b, y). But in general what you're asking works out of the box:
julia> using Distributions
julia> μ = 2.0; σ = 3.0;
julia> logpdf(Normal(μ, σ), 0:0.5:4)
9-element Vector{Float64}:
-2.2397730440950046
-2.1425508218727822
-2.073106377428338
-2.0314397107616715
-2.0175508218727827
-2.0314397107616715
-2.073106377428338
-2.1425508218727822
-2.2397730440950046
Here I'm getting the log pdf at values 0, 0.5, 1, ..., 3.5, 4. This works because there's a method for logpdf which takes an AbstractArray as second argument:
julia> #which logpdf(Normal(μ, σ), 0:0.5:4)
logpdf(d::UnivariateDistribution{S} where S<:ValueSupport, X::AbstractArray) in Distributions at deprecated.jl:70
julia> #which logpdf(Normal(μ, σ), 0.5)
logpdf(d::Normal, x::Real) in Distributions at ...\Distributions\bawf4\src\univariate\continuous\normal.jl:105
As you see there though, that method signature is actually deprecated. Let's start Julia with depwarn=yes to see the deprecation notice:
$> julia --depwarn=yes
julia> using Distributions
julia> logpdf(Normal(), 1:10)
┌ Warning: `logpdf(d::UnivariateDistribution, X::AbstractArray)` is deprecated, use `logpdf.(d, X)` instead.
│ caller = top-level scope at REPL[4]:1
└ # Core REPL[4]:1
What this tells you is that actually you don't need a method signature which accepts an array, as Julia's built-in broadcasting syntax - appending a dot to a function call - gives you this for free. Returning to the first example:
julia> logpdf.(Normal(μ, σ), 0:0.5:4)
9-element Vector{Float64}:
-2.2397730440950046
-2.1425508218727822
-2.073106377428338
-2.0314397107616715
-2.0175508218727827
-2.0314397107616715
-2.073106377428338
-2.1425508218727822
-2.2397730440950046
Here, I'm actually calling the logpdf(d::Normal, x::Real) method, but the . after logpdf applies the function elementwise to the range 0:0.5:4.
The broadcast syntax also extends to constructors, so you can use it to construct multiple normal distributions with different mean:
julia> μ = rand(3)
3-element Vector{Float64}:
0.5341692431981215
0.5696647074299088
0.3021675356902611
julia> Normal.(μ, 5)
3-element Vector{Normal{Float64}}:
Normal{Float64}(μ=0.5341692431981215, σ=5.0)
Normal{Float64}(μ=0.5696647074299088, σ=5.0)
Normal{Float64}(μ=0.3021675356902611, σ=5.0)
that's what the error above is telling you - the Normal constructor does not accept a vector as first element, but a single value. If you want to apply it to multiple values, just broadcast!
In Python, it is usually suggested to vectorize the code to make computation faster. For example, if you want to compute the inner product of two vectors, say a and b, usually
Code A
c = np.dot(a, b)
is better than
Code B
c = 0
for i in range(len(a)):
c += a[i] * b[i]
But in Julia, it seems sometimes vectorization is not that helpful. I reckoned '* and dot as vectorized versions and an explicit for loop as a non-vectorized version and got the following result.
using Random
using LinearAlgebra
len = 1000000
rng1 = MersenneTwister(1234)
a = rand(rng1, len)
rng2 = MersenneTwister(12345)
b = rand(rng2, len)
function inner_product(a, b)
result = 0
for i in 1: length(a)
result += a[i] * b[i]
end
return result
end
#time a' * b
#time dot(a, b)
#time inner_product(a, b);
0.013442 seconds (56.76 k allocations: 3.167 MiB)
0.003484 seconds (106 allocations: 6.688 KiB)
0.008489 seconds (17.52 k allocations: 976.752 KiB)
(I know using BenchmarkTools.jl is a more standard way to measure the performance.)
From the output, dot runs faster than for than '*, which is a contradiction to what has been presumed.
So my question is,
does Julia need (or sometimes need) vectorization to speed up computation?
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Your are not making the benchmarks correctly and the implementation of your function is suboptimal.
julia> using BenchmarkTools
julia> #btime $a' * $b
429.400 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime dot($a,$b)
426.299 μs (0 allocations: 0 bytes)
249985.3680190253
julia> #btime inner_product($a, $b)
970.500 μs (0 allocations: 0 bytes)
249985.36801903677
The correct implementation:
function inner_product_correct(a, b)
result = 0.0 #use the same type as elements in the args
#simd for i in 1: length(a)
#inbounds result += a[i] * b[i]
end
return result
end
julia> #btime inner_product_correct($a, $b)
530.499 μs (0 allocations: 0 bytes)
249985.36801902478
There is still the difference (however less significant) because dot is using the optimized BLAS implementation which is multi-threaded. You could (following Bogumil's comment set OPENBLAS_NUM_THREADS=1 and then you will find that the times of BLAS function will be identical as the Julia implementation.
Note also that working with float numbers is tricky in many ways:
julia> inner_product_correct(a, b)==dot(a,b)
false
julia> inner_product_correct(a, b) ≈ dot(a,b)
true
Finally, in Julia deciding whether to use vectorization or write the loop yourself is up two you - there is no performance penalty (as long as you write type stable code and use #simd and #inbounds where required). However in your codes you were not testing vectorization but you were comparing calling BLAS to writing the loop yourself. Here is the must-read to understand what is going on https://docs.julialang.org/en/v1/manual/performance-tips/
Let me add my practical experience as a comment (too long for a standard comment):
does Julia need (or sometimes need) vectorization to speed up computation?
Julia does not need vectorization as Python (see the answer by Przemysław), but in practice if you have a well written vectorized function (like dot) then use it as, while possible, it might be sometimes tricky to write as performant function yourself (people have probably spent days on optimizing dot, especially to optimally use multiple threads).
If it does, then when to use vectorization and which is the better way to use (consider dot and '*)?
When you use vectorized code then it all depends on implementation of the function you want to use. In this case dot(a, b) and a' * b are exactly the same as when you write #edit a' * b gives you in this case:
*(u::AdjointAbsVec{<:Number}, v::AbstractVector{<:Number}) = dot(u.parent, v)
and you see it is the same.
If it does not, then what is the difference between Julia and Python in terms of the mechanism of vectorized and non-vectorized codes?
Julia is a compiled language, while Python is an interpreted language. In some cases Python interpreter can provide fast execution speed, but in other cases it currently is not able to do it (but it does not mean that in the future it will not improve). In particular vectorized functions (like dot in your question) are most likely written in some compiled language, so Julia and Python will not differ much in typical cases as they just call this compiled function. However, when you use loops (non-vectorized code) then currently Python will be slower than Julia.
I am currently attempting to implement a metaheuristic (genetic) algorithm. In this venture i also want to try and create somewhat fast and efficient code. However, my experience in creating efficient coding is not very great. I was therefore wondering if some people could give some "quick tips" to increase the efficiency of my code. I have created a small functional example of my code which contains most of the elements that the code will contain i regards to preallocating arrays, custom mutable structs, random numbers, pushing into arrays etc.
The options that I have already attempted to explore are options in regards to the package "StaticArrays". However many of my arrays must be mutable (so we need MArrays) and many of them will become very large > 100. The documentation of StaticArrays specify that the size of the StaticArrays package must remain small to remain efficient.
According to the documentation Julia 1.5.2 should be thread safe in regards to rand(). I have therefor attempted to multithread for-loops in my functions to make them run faster. And this results in a slight performance increase .
However if people can se a more efficient way of allocating Arrays or pushing in SpotPrices into an array it would be greatly appreciated! Any other performance tips are also very welcome!
# Packages
clearconsole()
using DataFrames
using Random
using BenchmarkTools
Random.seed!(42)
df = DataFrame( SpotPrice = convert(Array{Float64}, rand(-266:500,8832)),
month = repeat([1,2,3,4,5,6,7,8,9,10,11,12]; outer = 736),
hour = repeat([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]; outer = 368))
# Data structure for the prices per hour
mutable struct SpotPrices
hour :: Array{Float64,1}
end
# Fill-out data structure
function setup_prices(df::DataFrame)
prices = []
for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
prices = setup_prices(df)
# Sampler function
function MC_Sampler(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
samples = MC_Sampler(prices, 100)
#btime setup_prices(df)
#btime MC_Sampler(prices,100)
function setup_prices_par(df::DataFrame)
prices = []
#sync Threads.#threads for i in 1:length(unique(df[:,3]))
push!(prices, SpotPrices(filter(row -> row.hour == i, df).SpotPrice))
end
return prices
end
# Sampler function
function MC_Sampler_par(prices::Vector{Any}, sample_size::Int64)
# Picking the samples
tmp = zeros(sample_size, 24)
# Sampling per hour
#sync Threads.#threads for i in 1:24
tmp[:,i] = rand(prices[i].hour, sample_size)
end
return tmp
end
#btime setup_prices_par(df)
#btime MC_Sampler_par(prices,100)
Have a look at read very carefully https://docs.julialang.org/en/v1/manual/performance-tips/
Basic cleanups start with:
Your SpotPrices struct does not need to me mutable. Anyway since there is only one field you could just define it as SpotPrices=Vector{Float64}
You do not want untyped containers - instead of prices = [] do prices = Float64[]
Using DataFrames.groupby will be much faster than finding unique elements and filtering by them
If yo do not need initialze than do not do it Vector{Float64}(undef, sample_size) is much faster than zeros(sample_size, 24)
You do not need to synchronize #sync before a multi-threaded loop
Create a random states - one separate one for each thread and use them whenever calling the rand function
I am trying to evaluate points in a large piecewise polynomial, which is obtained from a cubic-spline. This takes a long time to do and I would like to speed it up.
As such, I would like to evaluate a points on a piecewise polynomial with parallel processes, rather than sequentially.
Code:
z = zeros(1e6, 1) ; % preallocate some memory for speed
Y = rand(11220,161) ; %some data, rand for generating a working example
X = 0 : 0.0125 : 2 ; % vector of data sites
pp = spline(X, Y) ; % get the piecewise polynomial form of the cubic spline.
The resulting structure is large.
for t = 1 : 1e6 % big number
hcurrent = ppval(pp,t); %evaluate the piecewise polynomial at t
z(t) = sum(x(t:t+M-1).*hcurrent,1) ; % do some operation of the interpolated value. Most likely not relevant to this question.
end
Unfortunately, with matrix form and using:
hcurrent = flipud(ppval(pp, 1: 1e6 ))
requires too much memory to process, so cannot be done. Is there a way that I can batch process this code to speed it up?
For scalar second arguments, as in your example, you're dealing with two issues. First, there's a good amount of function call overhead and redundant computation (e.g., unmkpp(pp) is called every loop iteration). Second, ppval is written to be general so it's not fully vectorized and does a lot of things that aren't necessary in your case.
Below is vectorized code code that take advantage of some of the structure of your problem (e.g., t is an integer greater than 0), avoids function call overhead, move some calculations outside of your main for loop (at the cost of a bit of extra memory), and gets rid of a for loop inside of ppval:
n = 1e6;
z = zeros(n,1);
X = 0:0.0125:2;
Y = rand(11220,numel(X));
pp = spline(X,Y);
[b,c,l,k,dd] = unmkpp(pp);
T = 1:n;
idx = discretize(T,[-Inf b(2:l) Inf]); % Or: [~,idx] = histc(T,[-Inf b(2:l) Inf]);
x = bsxfun(#power,T-b(idx),(k-1:-1:0).').';
idx = dd*idx;
d = 1-dd:0;
for t = T
hcurrent = sum(bsxfun(#times,c(idx(t)+d,:),x(t,:)),2);
z(t) = ...;
end
The resultant code takes ~34% of the time of your example for n=1e6. Note that because of the vectorization, calculations are performed in a different order. This will result in slight differences between outputs from ppval and my optimized version due to the nature of floating point math. Any differences should be on the order of a few times eps(hcurrent). You can still try using parfor to further speed up the calculation (with four already running workers, my system took just 12% of your code's original time).
I consider the above a proof of concept. I may have over-optmized the code above if your example doesn't correspond well to your actual code and data. In that case, I suggest creating your own optimized version. You can start by looking at the code for ppval by typing edit ppval in your Command Window. You may be able to implement further optimizations by looking at the structure of your problem and what you ultimately want in your z vector.
Internally, ppval still uses histc, which has been deprecated. My code above uses discretize to perform the same task, as suggested by the documentation.
Use parfor command for parallel loops. see here, also precompute z vector as z(j) = x(j:j+M-1) and hcurrent in parfor for speed up.
The Spline Parameters estimation can be written in Matrix form.
Once you write it in Matrix form and solve it you can use the Model Matrix to evaluate the Spline on all data point using Matrix Multiplication which is probably the most tuned operation in MATLAB.