Trying to understand Julia syntax in linear regression code (GLM package) - statistics

Total Julia noob here (with basic knowledge of Python). I am trying to do linear regression and things I read suggest the GLM package. Here is some sample code I found here:
using DataFrames, GLM
y = 1:10
df = DataFrame(y = y, x1 = y.^2, x2 = y.^3)
sm = GLM.lm( #formula(y ~ x1 + x2), df )
coef(sm)
Can someone explain the syntax here? What does #formula mean? Docs here say #foo means a
macro which I guess is basically just a function, but where do I find the function/macro formula? Just looking at the use here though, I would have thought it is maybe passing y ~ x1 + x2 (whatever that is) as the formula argument to lm? (similar to keyword arguments = in python?)
Next, what is ~ here? General docs say ~ means negation but I'm not seeing how that makes here.
Is there a place in the GLM docs where all of this is explained? I'm not seeing that. Only seeing a few examples but not a full breakdown of each function and all of its arguments.

You have stumbled upon the #formula language that is defined in the StatsModels.jl package and implemented in many statistics/econometrics related packages across the Julia ecosystem.
As you say, #formula is a macro, which transforms the expression given to it (here y ~ x1 + x2) into some other Julia expression. If you want to find out what happens when a macro gets called in Julia - which I admit can often look like magic to new (and sometimes experienced!) users - the #macroexpand macro can help you. In this case:
julia> #macroexpand #formula(y ~ x1 + x2)
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2))
The result above is the expression constructed by the #formula macro. We see that the variables in our formula macro are transformed into StatsModels.Term objects. If we were to use StatsModels directly, we could construct this ourselves by doing:
julia> Term(:y) ~ Term(:x1) + Term(:x2)
FormulaTerm
Response:
y(unknown)
Predictors:
x1(unknown)
x2(unknown)
julia> (Term(:y) ~ Term(:x1) + Term(:x2)) == #formula(y ~ x1 + x2)
true
Now what is going on with ~, which as you say can be used for negation in Julia? What has happened here is that StatsModels has defined methods for ~ (which in Julia is and infix operator, that means essentially it is a function that can be written in between its arguments rather than having to be called with its arguments in brackets:
julia> (Term(:y) ~ Term(:x)) == ~(Term(:y), Term(:x))
true
So writing y::Term ~ x::Term is the same as calling ~(y::Term, x::Term), and this method for calling ~ with terms on the left and right hand side is defined by StatsModels (see method no. 6 below):
julia> methods(~)
# 6 methods for generic function "~":
[1] ~(x::BigInt) in Base.GMP at gmp.jl:542
[2] ~(::Missing) in Base at missing.jl:100
[3] ~(x::Bool) in Base at bool.jl:39
[4] ~(x::Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}) in Base at int.jl:254
[5] ~(n::Integer) in Base at int.jl:138
[6] ~(lhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}, rhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}) in StatsModels at /home/nils/.julia/packages/StatsModels/pMxlJ/src/terms.jl:397
Note that you also find the general negation meaning here (method 3 above, which defines the behaviour for calling ~ on a boolean argument and is in Base Julia).
I agree that the GLM.jl docs maybe aren't the most comprehensive in the world, but one of the reasons for that is that the whole machinery behind #formula actually isn't a GLM.jl thing - so do check out the StatsModels docs linked above which are quite good I think.

Related

Is there any method/solver in python to solve embedded derivatives in a ODE equation?

I've got this equation from mathematical model to know the thermal behavior of a battery.
dTsdt = Ts * a+ Ta * b + dTadt * c + d
However, i can't get to solve it due to the nested derivatives.
I need to solve the equation for Ts and Ta.
I tried to define it as follows, but python does not like it and several eŕrors show up.
Im using scipy.integrate and the solver ODEint
Since the model takes data from vectors, it has to be solved for every time step and record the output accordingly.
I also tried assinging the derivatives to a variable v1,v2, and then put everything in an equation without derivatives like the second approach shown as follows.
def Tmodel(z,t,a,b,c,d):
    Ts,Ta= z
    dTsdt = Ts*a+ Ta*b + dTadt*c+ d
    dzdt=[dTsdt]
    return dzdt
z0=[0,0]
# solve ODE
for i in range(0,n-1):
   
    tspan = [t[i],t[i+1]]
    # solve for next step
    z = odeint(Tmodel,z0,tspan,arg=(a[i],b[i],c[i],d[i],))
    # store solution for plotting
    Ts[i] = z[1][0]
    Ta[i] = z[1][1]
    # next initial condition
    z0 = z[1]
def Tmodel(z,t,a,b,c,d):
    Ts,v1,Ta,v2= z
# v1= dTsdt
# v2= dTadt
    v1 = Ts*a+ Ta*b + v2*c+ d
    dzdt=[v1,v2]
    return dzdt
That did not work either.I believe there might be a solver capable of solving that equation or the equation must be decouple in a way and solve accordingly.
Any advice on how to solve such eqtn with python would be appreciate it.
Best regards,
MM
Your difficulty seems to be that you are given Ta in a form with no easy derivative, so you do not know where to take it from. One solution is to avoid this derivative completely and solve the system for y=Ts-c*Ta. Substitute Ts=y+c*Ta in the right side to get
dy/dt = y*a + Ta*(b+c*a) + d
Of course, this requires then a post-processing step Ts=y+c*Ta to get to the requested variable.
If Ta is given as function table, use an interpolation function to get values at any odd time t that is demanded by the ODE solver.
Ta_func = interp1d(Ta_times,Ta_values)
def Tmodel(y,t,a,b,c,d):
Ta= Ta_func(t)
dydt = y*a+ Ta*(b+c*a) + d
return dydt
y[0] = Ts0-c*Ta_func(t[0])
for i in range(len(t)-1):
y[i+1] = odeint(Tmodel,y[i],t[i:i+2],arg=(a[i],b[i],c[i],d[i],))[-1,0]
Ts = y + c*Ta_func(t)

Evaluating logpdf of vector of observations where each observation has different mean parameter

New to Julia and just trying to implement a basic Bayesian model. I would like to evaluate the log-likelihood of each data point, where each data point has a different mean parameter depending on their corresponding covariate, without having to implement a for loop over all data points.
using Distributions
y = -50:1:49
a = 1
b = 1
N = 100
x = rand(Normal(0, 1), N)
mu = a .+ b.*x
sigma = 5
# Can we evaluate the logpdf of every point in one call to logpdf without doing a for loop
loglikelihood = logpdf(Normal(mu, sigma), y)
MethodError: no method matching Normal(::Vector{Float64}, ::Int64)
Edit: I would like to clarify that the mu specified above is a vector of the same dimensions as y, and that instead evaluating logpdf of each observation using the function Normal(::Real, ::Real) in an iterative procedure, I would like to something that handles something to the effect of
logpdf(Normal(::Array, ::Real), ::Array). The code I provide in the following chunk does what I want by taking the sum of the log-likelihood across observations, but I would prefer to not have to transform to a multivariate distribution.
using LinearAlgebra
logpdf(MvNormal(mu, diagm(repeat([sigma], outer=N))), y)
Thanks for your help.
Your code doesn't actually run, as there are undefined variables (a, b, y). But in general what you're asking works out of the box:
julia> using Distributions
julia> μ = 2.0; σ = 3.0;
julia> logpdf(Normal(μ, σ), 0:0.5:4)
9-element Vector{Float64}:
-2.2397730440950046
-2.1425508218727822
-2.073106377428338
-2.0314397107616715
-2.0175508218727827
-2.0314397107616715
-2.073106377428338
-2.1425508218727822
-2.2397730440950046
Here I'm getting the log pdf at values 0, 0.5, 1, ..., 3.5, 4. This works because there's a method for logpdf which takes an AbstractArray as second argument:
julia> #which logpdf(Normal(μ, σ), 0:0.5:4)
logpdf(d::UnivariateDistribution{S} where S<:ValueSupport, X::AbstractArray) in Distributions at deprecated.jl:70
julia> #which logpdf(Normal(μ, σ), 0.5)
logpdf(d::Normal, x::Real) in Distributions at ...\Distributions\bawf4\src\univariate\continuous\normal.jl:105
As you see there though, that method signature is actually deprecated. Let's start Julia with depwarn=yes to see the deprecation notice:
$> julia --depwarn=yes
julia> using Distributions
julia> logpdf(Normal(), 1:10)
┌ Warning: `logpdf(d::UnivariateDistribution, X::AbstractArray)` is deprecated, use `logpdf.(d, X)` instead.
│ caller = top-level scope at REPL[4]:1
└ # Core REPL[4]:1
What this tells you is that actually you don't need a method signature which accepts an array, as Julia's built-in broadcasting syntax - appending a dot to a function call - gives you this for free. Returning to the first example:
julia> logpdf.(Normal(μ, σ), 0:0.5:4)
9-element Vector{Float64}:
-2.2397730440950046
-2.1425508218727822
-2.073106377428338
-2.0314397107616715
-2.0175508218727827
-2.0314397107616715
-2.073106377428338
-2.1425508218727822
-2.2397730440950046
Here, I'm actually calling the logpdf(d::Normal, x::Real) method, but the . after logpdf applies the function elementwise to the range 0:0.5:4.
The broadcast syntax also extends to constructors, so you can use it to construct multiple normal distributions with different mean:
julia> μ = rand(3)
3-element Vector{Float64}:
0.5341692431981215
0.5696647074299088
0.3021675356902611
julia> Normal.(μ, 5)
3-element Vector{Normal{Float64}}:
Normal{Float64}(μ=0.5341692431981215, σ=5.0)
Normal{Float64}(μ=0.5696647074299088, σ=5.0)
Normal{Float64}(μ=0.3021675356902611, σ=5.0)
that's what the error above is telling you - the Normal constructor does not accept a vector as first element, but a single value. If you want to apply it to multiple values, just broadcast!

Bachelier Normal Implied Vol Python Calculation (Help) Jekel

Writing a python script to calc Implied Normal Vol ; in line with Jekel article (Industry Standard).
https://jaeckel.000webhostapp.com/ImpliedNormalVolatility.pdf
They say they are using a Generalized Incomplete Gamma Function Inverse.
For a call:
F(x)=v/(K - F) -> find x that makes this true
Where F is Inverse Incomplete Gamma Function
And x = (K - F)/(T*sqrt(T) ; v is the value of a call
for that x, IV is =(K-F)/x*sqrt(T)
Example I am working with:
F=40
X=38
T=100/365
v=5.25
Vol= 20%
Using the equations I should be able to backout Vol of 20%
Scipy has upper and lower Incomplete Gamma Function Inverse in their special functions.
Lower: scipy.special.gammaincinv(a, y) : {a must be positive param}
Upper: scipy.special.gammainccinv(a, y) : {a must be positive param}
Implementation:
SIG= sympy.symbols('SIG')
F=40
T=100/365
K=38
def Objective(sig):
SIG=sig
return(special.gammaincinv(.5,((F-K)**2)/(2*T*SIG**2))+special.gammainccinv(.5,((F-K)**2)/(2*T*SIG**2))+5.25/(K-F))
x=optimize.brentq(Objective, -20.00,20.00, args=(), xtol=1.48e-8, rtol=1.48e-8, maxiter=1000, full_output=True)
IV=(K-F)/x*T**.5
Print(IV)
I know I am wrong, but Where am I going wrong / how do I fix it and use what I read in the article ?
Did you also post this on the Quantitative Finance Stack Exchange? You may get a better response there.
This is not my field, but it looks like your main problem is that brentq requires the passed Objective function to return values with opposite signs when passed the -20 and 20 arguments. However, this will not end up happening because according to the scipy docs, gammaincinv and gammainccinv always return a value between 0 and infinity.
I'm not sure how to fix this, unfortunately. Did you try implementing the analytic solution (rather than iterative root finding) in the second part of the paper?

Element-wise variance of an iterator

What's a numerically-stable way of taking the variance of an iterator elementwise? As an example, I would like to do something like
var((rand(4,2) for i in 1:10))
and get back a (4,2) matrix which is the variance in each coefficient. This throws an error using Julia's Base var. Is there a package that can handle this? Or an easy (and storage-efficient) way to do this using the Base Julia function? Or does one need to be developed on its own?
I went ahead and implemented a Welford algorithm to calculate this:
# Welford algorithm
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
function componentwise_meanvar(A;bessel=true)
x0 = first(A)
n = 0
mean = zero(x0)
M2 = zero(x0)
delta = zero(x0)
delta2 = zero(x0)
for x in A
n += 1
delta .= x .- mean
mean .+= delta./n
delta2 .= x .- mean
M2 .+= delta.*delta2
end
if n < 2
return NaN
else
if bessel
M2 .= M2 ./ (n .- 1)
else
M2 .= M2 ./ n
end
return mean,M2
end
end
A few other algorithms are implemented in DiffEqMonteCarlo.jl as well. I'm surprised I couldn't find a library for this, but maybe will refactor this out someday.
See update below for a numerically stable version
Another method to calculate this:
srand(0) # reset random for comparing across implementations
moment2var(t) = (t[3]-t[2].^2./t[1])./(t[1]-1)
foldfunc(x,y) = (x[1]+1,x[2].+y,x[3].+y.^2)
moment2var(foldl(foldfunc,(0,zeros(1,1),zeros(1,1)),(rand(4,2) for i=1:10)))
Gives:
4×2 Array{Float64,2}:
0.0848123 0.0643537
0.0715945 0.0900416
0.111934 0.084314
0.0819135 0.0632765
Similar to:
srand(0) # reset random for comparing across implementations
# naive component-wise application of `var` function
map(var,zip((rand(4,2) for i=1:10)...))
which is the non-iterator version (or offline version in CS terminology).
This method is based on calculation of variance from mean and sum-of-squares. moment2var and foldfunc are just a helper functions, but it fits in one-line without them.
Comments:
Speedwise, this should be pretty good as well. Perhaps, StaticArrays and initializing the foldl's v0 with the correct eltype of the iterator would save even more time.
Benchmarking gave 5x speed advantage (and better memory usage) over componentwise_meanvar (from another answer) on a sample input.
Using moment2meanvar(t)=(t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1)‌​) gives both mean and variance like componentwise_meanvar.
As #ChrisRackauckas noted, this method suffers from numerical instability when number of elements to sum is large.
--- UPDATE with variant of method ---
A little abstraction of the question asks for a way to do a foldl (and reduce,foldr) on an iterator returning a matrix, element-wise and retaining shape. To do so, we can define an assisting function mfold which takes a folding-function and makes it fold matrices element-wise. Define it as follows:
mfold(f) = (x,y)->[f(t[1],t[2]) for t in zip(x,y)]
For this specific problem of variance, we can define the component-wise fold functions, and a final function to combine the moments into the variance (and mean if wanted). The code:
ff(x,y) = (x[1]+1,x[2]+y,x[3]+y^2) # fold and collect moments
moment2var(t) = (t[3]-t[2]^2/t[1])/(t[1]-1) # calc variance from moments
moment2meanvar(t) = (t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1))
We can see moment2meanvar works on a single vector as follows:
julia> moment2meanvar(foldl(ff,(0.0,0.0,0.0),[1.0,2.0,3.0]))
(2.0, 1.0)
Now to matrix-ize it using foldm (using .-notation):
moment2var.(foldl(mfold(ff),fill((0,0,0),(4,2)),(rand(4,2) for i=1:10)))
#ChrisRackauckas noted this is not numerically stable, and another method (detailed in Wikipedia) is better. Using foldm this could be implemented as:
# better fold function compensating the sums for stability
ff2(x,y) = begin
delta=y-x[2]
mean=x[2]+delta/(x[1]+1)
return (x[1]+1,mean,x[3]+delta*(y-mean))
end
# combine the collected information for the variance (and mean)
m2var(t) = t[3]/(t[1]-1)
m2meanvar(t) = (t[2],t[3]/(t[1]-1))
Again we have:
m2var.(foldl(mfold(ff2),fill((0,0.0,0.0),(4,2)),(rand(4,2) for i=1:10)))
Giving the same results (perhaps a little more accurately).
Or an easy (and storage-efficient) way to do this using the Base Julia function?
Out of curiosity, why is the standard solution of using var along the external dimension not good for you?
julia> var(cat(3,(rand(4,2) for i in 1:10)...),3)
4×2×1 Array{Float64,3}:
[:, :, 1] =
0.08847 0.104799
0.0946243 0.0879721
0.105404 0.0617594
0.0762611 0.091195
Obviously, I'm using cat here, which clearly is not very storage efficient, just so I can use the Base Julia function and your original generator syntax as per your question. But you could make this storage efficient as well, if you initialise your random values directly on a preallocated array of size (4,2,10), so that's not really an issue here.
Or did I misunderstand your question?
EDIT - benchmark in response to comments
function standard_var(Y, A)
for i in 1 : length(A)
Y[:,:,i], = next(A,i);
end
var(Y,3)
end
function testit()
A = (rand(4,2) for i in 1:10000);
Y = Array{Float64, 3}(4,2,length(A));
#time componentwise_meanvar(A); # as defined in Chris's answer above
#time standard_var(Y, A) # standard variance + using preallocation
#time var(cat(3, A...), 3); # standard variance without preallocation
return nothing
end
julia> testit()
0.004258 seconds (10.01 k allocations: 1.374 MiB)
0.006368 seconds (49.51 k allocations: 2.129 MiB)
5.954470 seconds (50.19 M allocations: 2.989 GiB, 71.32% gc time)

Is there a language with constrainable types?

Is there a typed programming language where I can constrain types like the following two examples?
A Probability is a floating point number with minimum value 0.0 and maximum value 1.0.
type Probability subtype of float
where
max_value = 0.0
min_value = 1.0
A Discrete Probability Distribution is a map, where: the keys should all be the same type, the values are all Probabilities, and the sum of the values = 1.0.
type DPD<K> subtype of map<K, Probability>
where
sum(values) = 1.0
As far as I understand, this is not possible with Haskell or Agda.
What you want is called refinement types.
It's possible to define Probability in Agda: Prob.agda
The probability mass function type, with sum condition is defined at line 264.
There are languages with more direct refinement types than in Agda, for example ATS
You can do this in Haskell with Liquid Haskell which extends Haskell with refinement types. The predicates are managed by an SMT solver at compile time which means that the proofs are fully automatic but the logic you can use is limited by what the SMT solver handles. (Happily, modern SMT solvers are reasonably versatile!)
One problem is that I don't think Liquid Haskell currently supports floats. If it doesn't though, it should be possible to rectify because there are theories of floating point numbers for SMT solvers. You could also pretend floating point numbers were actually rational (or even use Rational in Haskell!). With this in mind, your first type could look like this:
{p : Float | p >= 0 && p <= 1}
Your second type would be a bit harder to encode, especially because maps are an abstract type that's hard to reason about. If you used a list of pairs instead of a map, you could write a "measure" like this:
measure total :: [(a, Float)] -> Float
total [] = 0
total ((_, p):ps) = p + probDist ps
(You might want to wrap [] in a newtype too.)
Now you can use total in a refinement to constrain a list:
{dist: [(a, Float)] | total dist == 1}
The neat trick with Liquid Haskell is that all the reasoning is automated for you at compile time, in return for using a somewhat constrained logic. (Measures like total are also very constrained in how they can be written—it's a small subset of Haskell with rules like "exactly one case per constructor".) This means that refinement types in this style are less powerful but much easier to use than full-on dependent types, making them more practical.
Perl6 has a notion of "type subsets" which can add arbitrary conditions to create a "sub type."
For your question specifically:
subset Probability of Real where 0 .. 1;
and
role DPD[::T] {
has Map[T, Probability] $.map
where [+](.values) == 1; # calls `.values` on Map
}
(note: in current implementations, the "where" part is checked at run-time, but since "real types" are checked at compile-time (that includes your classes), and since there are pure annotations (is pure) inside the std (which is mostly perl6) (those are also on operators like *, etc), it's only a matter of effort put into it (and it shouldn't be much more).
More generally:
# (%% is the "divisible by", which we can negate, becoming "!%%")
subset Even of Int where * %% 2; # * creates a closure around its expression
subset Odd of Int where -> $n { $n !%% 2 } # using a real "closure" ("pointy block")
Then you can check if a number matches with the Smart Matching operator ~~:
say 4 ~~ Even; # True
say 4 ~~ Odd; # False
say 5 ~~ Odd; # True
And, thanks to multi subs (or multi whatever, really – multi methods or others), we can dispatch based on that:
multi say-parity(Odd $n) { say "Number $n is odd" }
multi say-parity(Even) { say "This number is even" } # we don't name the argument, we just put its type
#Also, the last semicolon in a block is optional
Nimrod is a new language that supports this concept. They are called Subranges. Here is an example. You can learn more about the language here link
type
TSubrange = range[0..5]
For the first part, yes, that would be Pascal, which has integer subranges.
The Whiley language supports something very much like what you are saying. For example:
type natural is (int x) where x >= 0
type probability is (real x) where 0.0 <= x && x <= 1.0
These types can also be implemented as pre-/post-conditions like so:
function abs(int x) => (int r)
ensures r >= 0:
//
if x >= 0:
return x
else:
return -x
The language is very expressive. These invariants and pre-/post-conditions are verified statically using an SMT solver. This handles examples like the above very well, but currently struggles with more complex examples involving arrays and loop invariants.
For anyone interested, I thought I'd add an example of how you might solve this in Nim as of 2019.
The first part of the questions is straightfoward, since in the interval since since this question was asked, Nim has gained the ability to generate subrange types on floats (as well as ordinal and enum types). The code below defines two new float subranges types, Probability and ProbOne.
The second part of the question is more tricky -- defining a type with constrains on a function of it's fields. My proposed solution doesn't directly define such a type but instead uses a macro (makePmf) to tie the creation of a constant Table[T,Probability] object to the ability to create a valid ProbOne object (thus ensuring that the PMF is valid). The makePmf macro is evaluated at compile time, ensuring that you can't create an invalid PMF table.
Note that I'm a relative newcomer to Nim so this may not be the most idiomatic way to write this macro:
import macros, tables
type
Probability = range[0.0 .. 1.0]
ProbOne = range[1.0..1.0]
macro makePmf(name: untyped, tbl: untyped): untyped =
## Construct a Table[T, Probability] ensuring
## Sum(Probabilities) == 1.0
# helper templates
template asTable(tc: untyped): untyped =
tc.toTable
template asProb(f: float): untyped =
Probability(f)
# ensure that passed value is already is already
# a table constructor
tbl.expectKind nnkTableConstr
var
totprob: Probability = 0.0
fval: float
newtbl = newTree(nnkTableConstr)
# create Table[T, Probability]
for child in tbl:
child.expectKind nnkExprColonExpr
child[1].expectKind nnkFloatLit
fval = floatVal(child[1])
totprob += Probability(fval)
newtbl.add(newColonExpr(child[0], getAst(asProb(fval))))
# this serves as the check that probs sum to 1.0
discard ProbOne(totprob)
result = newStmtList(newConstStmt(name, getAst(asTable(newtbl))))
makePmf(uniformpmf, {"A": 0.25, "B": 0.25, "C": 0.25, "D": 0.25})
# this static block will show that the macro was evaluated at compile time
static:
echo uniformpmf
# the following invalid PMF won't compile
# makePmf(invalidpmf, {"A": 0.25, "B": 0.25, "C": 0.25, "D": 0.15})
Note: A cool benefit of using a macro is that nimsuggest (as integrated into VS Code) will even highlight attempts to create an invalid Pmf table.
Modula 3 has subrange types. (Subranges of ordinals.) So for your Example 1, if you're willing to map probability to an integer range of some precision, you could use this:
TYPE PROBABILITY = [0..100]
Add significant digits as necessary.
Ref: More about subrange ordinals here.

Resources