Total Julia noob here (with basic knowledge of Python). I am trying to do linear regression and things I read suggest the GLM package. Here is some sample code I found here:
using DataFrames, GLM
y = 1:10
df = DataFrame(y = y, x1 = y.^2, x2 = y.^3)
sm = GLM.lm( #formula(y ~ x1 + x2), df )
coef(sm)
Can someone explain the syntax here? What does #formula mean? Docs here say #foo means a
macro which I guess is basically just a function, but where do I find the function/macro formula? Just looking at the use here though, I would have thought it is maybe passing y ~ x1 + x2 (whatever that is) as the formula argument to lm? (similar to keyword arguments = in python?)
Next, what is ~ here? General docs say ~ means negation but I'm not seeing how that makes here.
Is there a place in the GLM docs where all of this is explained? I'm not seeing that. Only seeing a few examples but not a full breakdown of each function and all of its arguments.
You have stumbled upon the #formula language that is defined in the StatsModels.jl package and implemented in many statistics/econometrics related packages across the Julia ecosystem.
As you say, #formula is a macro, which transforms the expression given to it (here y ~ x1 + x2) into some other Julia expression. If you want to find out what happens when a macro gets called in Julia - which I admit can often look like magic to new (and sometimes experienced!) users - the #macroexpand macro can help you. In this case:
julia> #macroexpand #formula(y ~ x1 + x2)
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2))
The result above is the expression constructed by the #formula macro. We see that the variables in our formula macro are transformed into StatsModels.Term objects. If we were to use StatsModels directly, we could construct this ourselves by doing:
julia> Term(:y) ~ Term(:x1) + Term(:x2)
FormulaTerm
Response:
y(unknown)
Predictors:
x1(unknown)
x2(unknown)
julia> (Term(:y) ~ Term(:x1) + Term(:x2)) == #formula(y ~ x1 + x2)
true
Now what is going on with ~, which as you say can be used for negation in Julia? What has happened here is that StatsModels has defined methods for ~ (which in Julia is and infix operator, that means essentially it is a function that can be written in between its arguments rather than having to be called with its arguments in brackets:
julia> (Term(:y) ~ Term(:x)) == ~(Term(:y), Term(:x))
true
So writing y::Term ~ x::Term is the same as calling ~(y::Term, x::Term), and this method for calling ~ with terms on the left and right hand side is defined by StatsModels (see method no. 6 below):
julia> methods(~)
# 6 methods for generic function "~":
[1] ~(x::BigInt) in Base.GMP at gmp.jl:542
[2] ~(::Missing) in Base at missing.jl:100
[3] ~(x::Bool) in Base at bool.jl:39
[4] ~(x::Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}) in Base at int.jl:254
[5] ~(n::Integer) in Base at int.jl:138
[6] ~(lhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}, rhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}) in StatsModels at /home/nils/.julia/packages/StatsModels/pMxlJ/src/terms.jl:397
Note that you also find the general negation meaning here (method 3 above, which defines the behaviour for calling ~ on a boolean argument and is in Base Julia).
I agree that the GLM.jl docs maybe aren't the most comprehensive in the world, but one of the reasons for that is that the whole machinery behind #formula actually isn't a GLM.jl thing - so do check out the StatsModels docs linked above which are quite good I think.
I'm struggling to understand the multiplicity of a relation.
In general how should one interpret this
is this every entity of type P has between a and b entities of type C or between x and y or something else. All explanations I've found so for only adres the cases a,x = 0,1 and b,y = *
It's vice versa. P has x..y entities of type C in access and C has a..b of P.
As a side note: the multiplicity labels should not be placed to hide parts of the association.
Every Association contains two independent statements:
Every instance of P is linked to x..y instances of C
Every instance of C is linked to a..b instances of P
Being linked could mean, that P or C have an attribute of type C or P. This is the most common incarnation of a link, but UML does not prescribe this.
The following model produces instances with exactly 2 address relations when the number of Books is limited to 1, however, if more Books are allowed it will create instances with 0-3 address relations. My misunderstanding of how Alloy works?
sig Name{}
sig Addr{}
sig Book { addr: Name -> lone Addr }
pred show(b:Book) { #b.addr = 2 }
// nr. of address relations in every Book should be 2
run show for 3 but 2 Book
// works as expected with 1 Book
Each instance of show should include one Book, labeled as being the b of show, which has two address pairs. But show does not say that every book must have two address pairs, only that at least one must have two address pairs.
[Postscript]
When you ask Alloy to show you an instance of a predicate, for example by the command run show, then Alloy should show you an instance: that is (quoting section 5.2.1 of Software abstractions, which you already have open) "an assignment of values to the variables of the constraint for which the constraint evaluates to true." In any given universe, there may be many other possible assignments of values to the variables for which the constraint evaluates to false; the existence of such possible non-suitable assignments is unavoidable in any universe with more than one atom.
Informally, we can think of a run command for a predicate P with arguments X, Y, Z as requesting that Alloy show us universes which satisfy the expression
some X, Y, Z : univ | P[X, Y, Z]
The run command does not amount to the expression
all X, Y, Z : univ | P[X, Y, Z]
If you want to see only universes in which every book has two pairs in its addr relation, then say so:
pred all_books_have_2 { all b : Book | #b.addr = 2 }
I think it's better that run have implicit existential quantification, rather than implicit universal quantification. One way to see why is to imagine a model that defines trees, such as:
sig Node { parent : lone Node }
fact parent_acyclic { no n : Node | n in n.^parent }
Suppose we get tired of seeing universes in which every tree is trivial and contains a single node. I'd like to be able to define a predicate that guarantees at least one tree with depth greater than 1, by writing
pred nontrivial[n : Node]{ some n.parent }
Given the constraint that trees be acyclic, there can never be a non-empty universe in which the predicate nontrivial holds for all nodes. So if run and pred had the semantics you have been supposing, we could not use nontrivial to find universes containing non-trivial trees.
I hope this helps.
I have a little alloy specification as follows:
sig class {parents : set class}
fact f1{all p:class | not p in p.^parents}
run{} for exactly 4 class
First, I thought alloy would translate f1 into boolean matrix, then perform closure operation on it. But it seems it does not do this kind of translation (it looks like it runs something before boolean matrix creation.). So how exactly this f1 gets translated? Does it use relation predicate? I am just very curious about alloy's translation.
Boolean matrices are used to represent expression in Alloy. So, you start with a unary matrix for each sig, a binary matrix for each binary relation, a ternary matrix for each ternary relation, and so on. Then, translation of "complex" expression (e.g., involving relational algebra operators) is done by manipulating (composing) those matrices you started with. For each Alloy operator (e.g., transitive closure (^), relational join (.), in, not, etc.) there is a corresponding algorithm that performs a bunch of matrix operations such that the semantics of that operator is correctly implemented.
So in this example, the all quantifier is first unrolled, meaning that for each atom p of type class the body is translated (something like:
m0 = matrix(p) //returns matrix corresponding to p
m1 = matrix(parents) //returns matrix corresponding to parents
m2 = ^m1
m3 = m0.m2
m4 = m0 in m3
m5 = not m4
), and finally, all those body translations are AND'ed.
My program (Hartree-Fock/iterative SCF) has two matrices F and F' which are really the same matrix expressed in two different bases. I just lost three hours of debugging time because I accidentally used F' instead of F. In C++, the type-checker doesn't catch this kind of error because both variables are Eigen::Matrix<double, 2, 2> objects.
I was wondering, for the Haskell/ML/etc. people, whether if you were writing this program you would have constructed a type system where F and F' had different types? What would that look like? I'm basically trying to get an idea how I can outsource some logic errors onto the type checker.
Edit: The basis of a matrix is like the unit. You can say 1L or however many gallons, they both mean the same thing. Or, to give a vector example, you can say (0,1) in Cartesian coordinates or (1,pi/2) in polar. But even though the meaning is the same, the numerical values are different.
Edit: Maybe units was the wrong analogy. I'm not looking for some kind of record type where I can specify that the first field will be litres and the second gallons, but rather a way to say that this matrix as a whole, is defined in terms of some other matrix (the basis), where the basis could be any matrix of the same dimensions. E.g., the constructor would look something like mkMatrix [[1, 2], [3, 4]] [[5, 6], [7, 8]] and then adding that object to another matrix would type-check only if both objects had the same matrix as their second parameters. Does that make sense?
Edit: definition on Wikipedia, worked examples
This is entirely possible in Haskell.
Statically checked dimensions
Haskell has arrays with statically checked dimensions, where the dimensions can be manipulated and checked statically, preventing indexing into the wrong dimension. Some examples:
This will only work on 2-D arrays:
multiplyMM :: Array DIM2 Double -> Array DIM2 Double -> Array DIM2 Double
An example from repa should give you a sense. Here, taking a diagonal requires a 2D array, returns a 1D array of the same type.
diagonal :: Array DIM2 e -> Array DIM1 e
or, from Matt sottile's repa tutorial, statically checked dimensions on a 3D matrix transform:
f :: Array DIM3 Double -> Array DIM2 Double
f u =
let slabX = (Z:.All:.All:.(0::Int))
slabY = (Z:.All:.All:.(1::Int))
u' = (slice u slabX) * (slice u slabX) +
(slice u slabY) * (slice u slabY)
in
R.map sqrt u'
Statically checked units
Another example from outside of matrix programming: statically checked units of dimension, making it a type error to confuse e.g. feet and meters, without doing the conversion.
Prelude> 3 *~ foot + 1 *~ metre
1.9144 m
or for a whole suite of SI units and quanities.
E.g. can't add things of different dimension, such as volumes and lengths:
> 1 *~ centi litre + 2 *~ inch
Error:
Expected type: Unit DVolume a1
Actual type: Unit DLength a0
So, following the repa-style array dimension types, I'd suggest adding a Base phantom type parameter to your array type, and using that to distinguish between bases. In Haskell, the index Dim
type argument gives the rank of the array (i.e. its shape), and you could do similarly.
Or, if by base you mean some dimension on the units, using dimensional types.
So, yep, this is almost a commodity technique in Haskell now, and there's some examples of designing with types like this to help you get started.
This is a very good question. I don't think you can encode the notion of a basis in most type systems, because essentially anything that the type checker does needs to be able to terminate, and making judgments about whether two real-valued vectors are equal is too difficult. You could have (2 v_1) + (2 v_2) or 2 (v_1 + v_2), for example. There are some languages which use dependent types [ wikipedia ], but these are relatively academic.
I think most of your debugging pain would be alleviated if you simply encoded the bases in which you matrix works along with the matrix. For example,
newtype Matrix = Matrix { transform :: [[Double]],
srcbasis :: [Double], dstbasis :: [Double] }
and then, when you M from basis a to b with N, check that N is from b to c, and return a matrix with basis a to c.
NOTE -- it seems most people here have programming instead of math background, so I'll provide short explanation here. Matrices are encodings of linear transformations between vector spaces. For example, if you're encoding a rotation by 45 degrees in R^2 (2-dimensional reals), then the standard way of encoding this in a matrix is saying that the standard basis vector e_1, written "[1, 0]", is sent to a combination of e_1 and e_2, namely [1/sqrt(2), 1/sqrt(2)]. The point is that you can encode the same rotation by saying where different vectors go, for example, you could say where you're sending [1,1] and [1,-1] instead of e_1=[1,0] and e_2=[0,1], and this would have a different matrix representation.
Edit 1
If you have a finite set of bases you are working with, you can do it...
{-# LANGUAGE EmptyDataDecls #-}
data BasisA
data BasisB
data BasisC
newtype Matrix a b = Matrix { coefficients :: [[Double]] }
multiply :: Matrix a b -> Matrix b c -> Matrix a c
multiply (Matrix a_coeff) (Matrix b_coeff) = (Matrix multiplied) :: Matrix a c
where multiplied = undefined -- your algorithm here
Then, in ghci (the interactive Haskell interpreter),
*Matrix> let m = Matrix [[1, 2], [3, 4]] :: Matrix BasisA BasisB
*Matrix> m `multiply` m
<interactive>:1:13:
Couldn't match expected type `BasisB'
against inferred type `BasisA'
*Matrix> let m2 = Matrix [[1, 2], [3, 4]] :: Matrix BasisB BasisC
*Matrix> m `multiply` m2
-- works after you finish defining show and the multiplication algorithm
While I realize this does not strictly address the (clarified) question – my apologies – it seems relevant at least in relation to Don Stewart's popular answer...
I am the author of the Haskell dimensional library that Don referenced and provided examples from. I have also been writing – somewhat under the radar – an experimental rudimentary linear algebra library based on dimensional. This linear algebra library statically tracks the sizes of vectors and matrices as well as the physical dimensions ("units") of their elements on a per element basis.
This last point – tracking physical dimensions on a per element basis – is rather challenging and perhaps overkill for most uses, and one could even argue that it makes little mathematical sense to have quantities of different physical dimensions as elements in any given vector/matrix. However, some linear algebra applications of interest to me such as kalman filtering and weighted least squares estimation typically use heterogeneous state vectors and covariance matrices.
Using a Kalman filter as an example, consider a state vector x = [d, v] which has physical dimensions [L, LT^-1]. The next (future) state vector is predicted by multiplication by the state transition matrix F, i.e.: x' = F x_. Clearly for this equation to make sense F cannot be arbitrary but must have size and physical dimensions [[1, T], [T^-1, 1]]. The predict_x' function below statically ensures that this relationship holds:
predict_x' :: (Num a, MatrixVector f x x) => Mat f a -> Vec x a -> Vec x a
predict_x' f x_ = f |*< x_
(The unsightly operator |*< denotes multiplication of a matrix on the left with a vector on the right.)
More generally, for an a priori state vector x_ of arbitrary size and with elements of arbitrary physical dimensions, passing a state transition matrix f with "incompatible" size and/or physical dimensions to predict_x' will cause a compile time error.
In F# (which originally evolved from OCaml), you can use units of measure. Andrew Kenned, who designed the feature (and also created a very interesting theory behind it) has a great series of articles that demonstrate it.
This can quite likely be used in your scenario - although I don't fully understand the question. For example, you can declare two unit types like this:
[<Measure>] type litre
[<Measure>] type gallon
Adding litres and gallons gives you a compile time error:
1.0<litre> + 1.0<gallon> // Error!
F# doesn't automatically insert conversion between different units, but you can write a conversion function:
let toLitres gal = gal * 3.78541178<litre/gallon>
1.0<litre> + (toLitres 1.0<gallon>)
The beautiful thing about units of measure in F# is that they are automatically inferred and functions are generic. If you multiply 1.0<gallon> * 1.0<gallon>, the result is 1.0<gallon^2>.
People have used this feature for various things - ranging from conversion of virtual meters to screen pixels (in solar system simulations) to converting currencies (dollars in financial systems). Although I'm not expert, it is quite likely that you could use it in some way for your problem domain too.
If it's expressed in a different base, you can just add a template parameter to act as the base. That will differentiate those types. A float is a float is a float- if you don't want two float values to be the same if they actually have the same value, then you need to tell the type system about it.