Does Unbound always need to be in a `FreshM` monad? - haskell

I'm working on a project based on some existing code that uses the unbound library.
The code uses unsafeUnbind a bunch, which is causing me problems.
I've tried using freshen, but I get the following error:
error "fresh encountered bound name!
Please report this as a bug."
I'm wondering:
Is the library intended to be used entirely within a FreshM monad? Or are their ways to do things like lambda application without being in Fresh?
What kinds of values can I give to freshen, in order to avoid the errors they list?
If I end up using unsafeUnbind, under what conditions is it safe to use?

Is the library intended to be used entirely within a FreshM monad? Or are their ways to do things like lambda application without being in Fresh?
In most situations you will want to operate within a Fresh or an LFresh monad.
What kinds of values can I give to freshen, in order to avoid the errors they list?
So I think the reason you're getting the error is because you're passing a term to freshen rather than a pattern. In Unbound, patterns are like a generalization of names: a single Name E is a pattern consisting of a single variable which stands for Es, but also (p1, p2) or [p] are patterns comprised of a pair of patterns p1 and p2 or a list of patterns p, respectively. This lets you define terms that bind two variables at the same time, for example. Other more exotic type constructors include Embed t and Rebind p1 p2 former makes a pattern that embeds a term inside of a pattern, while the latter is similar to (p1,p2) except that the names within p1 scope over p2 (for example if p2 has Embeded terms in it, p1 will be scope over those terms). This is really powerful because it lets you define things like Scheme's let* form, or telescopes like in dependently typed languages. (See the paper for details).
Now finally the type constructorBind p t is what brings a term and a type together: A term Bind p t means that the names in p are bound in Bind p t and scope over t. So an (untyped) lambda term might be constructed with data Expr = Lam (Bind Var Expr) | App Expr Expr | V Var where type Var = Name Expr.
So back to freshen. You should only call freshen on patterns so calling it on something of type Bind p t is incorrect (and I suspect the source of the error message you're seeing) - you should call it on just the p and then apply the resulting permutation to the term t to apply the renaming that freshen constructs.
If I end up using `unsafeUnbind, under what conditions is it safe to use?
The place where I've used it is if I need to temporarily sneak under a binder and do some operation that I know for sure does not do anything to the names. An example might be collecting some source position annotations from a term, or replacing some global constant by a closed term. Also if you can guarantee that the term you're working with already has been renamed so any names that you unsafeUnbind are going to be unique already.
Hope this helps.
PS: I maintain unbound-generics which is a clone of Unbound, but using GHC.Generics instead of RepLib.

Related

Difficulty of implementing `case` expressions in a template-instantiation evaluator for a lazy functional language?

I'm following "Implementing functional languages: a tutorial" by SPJ, and I'm stuck on Exercise 2.18 (page 70), reproduced below. This is in the chapter about a template-instantiation evaluator for the simple lazy functional language described in the book (similar to a mini Miranda/Haskell):
Exercise 2.18. Why is it hard to introduce case expressions into the template instantiation machine?
(Hint: think about what instantiate would do with a case expression.)
The tutorial then goes on to cover an implementation of several less-general versions of destructuring structured data: an if primitive, a casePair primitive, and a caseList primitive. I haven't yet done the implementation for this section (Chapter 2 Mark 5), but I don't see why implementing these separately would be significantly easier than implementing a single case primitive.
The only plausible explanations I can offer is that the most generic case form is variadic in both number of alternatives (number of tags to match against) and arity (number of arguments to the structured data). All of the above primitives are fixed-arity and have a known number of alternatives. I don't see why this would make implementation significantly more difficult, however.
The instantiation of the case statement is straightforward:
Instantiate the scrutinee.
Instantiate the body expression of each alternative. (This may be somewhat wasteful if we substitute in unevaluated branches.) (I notice now this may be a problem, will post in an answer.)
Encapsulate the result in a new node type, NCase, where:
data Node = NAp Addr Addr
| ...
| NCase [(Int, [Name], Addr)]
Operationally, the reduction of the case statement is straightforward.
Check if the argument is evaluated.
If not, make it the new stack and push the current stack to the dump. (Similar to evaluating the argument of any primitive.)
If the argument is evaluated, then search for an alternative with a matching tag.
If no alternative with a matching tag is found, then throw an error (inexhaustive case branches).
Instantiate the body of the matching alternative with the environment augmented with the structured data arguments. (E.g., in case Pack {0, 2} 3 4 in <0> a b -> a + b, instantiate a + b with environment [a <- 3, b <- 4])
A new node type would likely have to be introduced for case (NCase) containing the list of alternatives, but that's not too dissuading.
I found a GitHub repository #bollu/timi which seems to implement a template-instantiation evaluator also following this tutorial. There is a section called "Lack of lambda and case", which attributes the lack of a generic case statement to the following reason:
Case requires us to have some notion of pattern matching / destructuring which is not present in this machine.
However, in this tutorial there is no notion of pattern-matching; we would simply be matching by tag number (an integer), so I'm not sure if this explanation is valid.
Aside, partly for myself: a very similar question was asked about special treatment for case statements in the next chapter of the tutorial (concerning G-machines rather than template-instantiation).
I think I figured it out while I was expanding on my reasoning in the question. I'll post here for posterity, but if someone has a more understandable explanation or is able to correct me I'll be happy to accept it.
The difficulty lies in the fact that the instantiate step performs all of the variable substitutions, and this happens separately from evaluation (the step function). The problem is as bollu says in the GitHub repository linked in the original question: it is not easy to destructure structured data at instantiation time. This makes it difficult to instantiate the bodies of all of the alternatives.
To illustrate this, consider the instantiation of let expressions. This works like so:
Instantiate each new binding expression.
Augment the current environment with the new bindings.
Instantiate the body with the augmented expression.
However, now consider the case of case expressions. What we want to do is:
Instantiate the scrutinee. (Which should eventually evaluate to the form Pack {m, n} a0 a1 ... an)
For each alternative (each of which has the form <m> b0 b1 ... bn -> body), augment the environment with the new bindings ([b0 <- a0, b1 <- a1, ..., bn <- an] and then instantiate the body of the alternative.)
The problem lies somewhere in between the two steps: calling instantiate on the scrutinee results in the instantiated Addr, but we don't readily have access to a1, a2, ... an to augment the environment with at instantiation time. While this might be possible if the scrutinee was a literal Pack value, if it needed further evaluation (e.g., was the evaluated result of a call to a supercombinator) then we would need to first evaluate it.
To solidify my own understanding, I'd like to answer the additional question: How do the primitives if, casePair, and caseList avoid this problem?
if trivially avoids this problem because boolean values are nullary. casePair and caseList avoid this problem by deferring the variable bindings using thunk(s); the body expressions get instantiated once the thunk is called, which is after the scrutinee is evaluated.
Possible solutions:
I'm thinking that it might be possible to get around this if we define a destructuring primitive operator on structured data objects. I.e., (Pack {m, n} a0 a1 ... an).3 would evaluate to a3.
In this case, what we could do is call instantiate scrut which would give us the address scrutAddr, and we could then augment the environment with new bindings [b0 <- (NAp .0 scrut), b1 <- (NAp .1 scrut), ..., bn <- (NAp .n scrut)].
The issue seems to lie in the fact that instantiation (substitution) and evaluation are separated. If variables were not instantiated separately from evaluation but rather added to/looked up from the environment upon binding/usage, then this would not be a problem. This is as if we placed the bodies of the case statements into thunks to be instantiated after the scrutinee is evaluated, which is similar to what casePair and caseList do.
I haven't worked through either of these alternate solutions or how much extra work they would incur.

Saving the result of applying a function to a variable in the same variable

I was remembering the haskell programming I learnt the last year and suddenly I had a little problem.
ghci> let test = [1,2,3,4]
ghci> test = drop 1 test
ghci> test
^CInterrupted.
I do not remember if it is possible.
Thanks!
test on the first line and test on the second line are not, in fact, the same variable. They're two different, separate, unrelated variables that just happen to have the same name.
Further, the concept of "saving in a variable" does not apply to Haskell. In Haskell, variables cannot be interpreted as "memory cells", which can hold values. Haskell's variables are more like mathematical variables - just names that you give to complex expressions for easier reasoning (well, this is a bit of an oversimplification, but good enough for now)
Consequently, variables in Haskell are immutable. You cannot change the value of a variable by assignment, like you can in many other languages. This property follows from interpreting the concept of "variable" in the mathematical sense, as described above.
Furthermore, definitions (aka "bindings") in Haskell are recursive. This means that the right side (the body) of a binding may refer to its left side. This is very handy for constructing infinite data structures, for example:
x = 42 : x
An infinite list of 42s
In your example, when you write test = drop 1 test, you're defining a list named test, which is completely unrelated to the list defined on the previous line, and which is equal to itself without the first element. It's only natural that trying to print such a list results in an infinite loop.
The bottom line is: you cannot do what you're trying to do. You cannot create a new binding, which shadows an existing binding, while at the same time references it. Just give it a different name.

In F# what does top-level mean?

When people talk about F# they sometimes mention the term top-level;
what does top-level mean?
For example in previous SO Q&A
Error FS0037 sometimes, very confusing
Defining Modules VS.NET vs F# Interactive
What the difference between a namespace and a module in F#?
AutoOpen attribute in F#
F# and MEF: Exporting Functions
How to execute this F# function
The term also appears regularly in comments but for those Q&A I did not reference them.
The Wikipedia article on scope touches on this, but has no specifics for F#.
The F# 3.x spec only states:
11.2.1.1 Arity Conformance for Functions and Values
The parentheses indicate a top-level function, which might be a
first-class computed expression that computes to a function value,
rather than a compile-time function value.
13.1 Custom Attributes
For example, the STAThread attribute should be placed immediately
before a top-level “do” statement.
14.1.8 Name Resolution for Type Variables
It is initially empty for any member or any other top-level construct
that contains expressions and types.
I suspect the term has different meanings in different contexts:
Scope, F# interactive, shadowing.
If you could also explain the origins from F# predecessor languages, (ML, CAML, OCaml) it would be appreciated.
Lastly I don't plan to mark an answer as accepted for a few days to avoid hasty answers.
I think the term top-level has different meaning in different contexts.
Generally speaking, I'd use it whenever you have some structure that allows nesting to refer to the one position at the top that is not nested inside anything else.
For example, if you said "top-level parentheses" in an expression, it would refer to the outer-most pair of parentheses:
((1 + 2) * (3 * (8)))
^ ^
When talking about functions and value bindings (and scope) in F#, it refers to the function that is not nested inside another function. So functions inside modules are top-level:
module Foo =
let topLevel n =
let nested a = a * 10
10 + nested n
Here, nested is nested inside topLevel.
In F#, functions and values defined using let can appear inside modules or inside classes, which complicates things a bit - I'd say only those inside modules are top-level, but that's probably just because they are public by default.
The do keyword works similarly - you can nest it (although almost nobody does that) and so top-level do that allows STAThread attribute is the one that is not nested inside another do or let:
module Foo =
[<STAThread>]
do
printfn "Hello!"
Bud it is not allowed on any do nested inside another expression:
do
[<STAThread>]
do
printfn "Hello!"
printfn "This is odd notation, I know..."

Achieving the right abstractions with Haskell's type system

I'm having trouble using Haskell's type system elegantly. I'm sure my problem is a common one, but I don't know how to describe it except in terms specific to my program.
The concepts I'm trying to represent are:
datapoints, each of which takes one of several forms, e.g. (id, number of cases, number of controls), (id, number of cases, population)
sets of datapoints and aggregate information: (set of id's, total cases, total controls), with functions for adding / removing points (so for each variety of point, there's a corresponding variety of set)
I could have a class of point types and define each variety of point as its own type. Alternatively, I could have one point type and a different data constructor for each variety. Similarly for the sets of points.
I have at least one concern with each approach:
With type classes: Avoiding function name collision will be annoying. For example, both types of points could use a function to extract "number of cases", but the type class can't require this function because some other point type might not have cases.
Without type classes: I'd rather not export the data constructors from, say, the Point module (providing other, safer functions to create a new value). Without the data constructors, I won't be able to determine of which variety a given Point value is.
What design might help minimize these (and other) problems?
To expand a bit on sclv's answer, there is an extended family of closely-related concepts that amount to providing some means of deconstructing a value: Catamorphisms, which are generalized folds; Church-encoding, which represents data by its operations, and is often equivalent to partially applying a catamorphism to the value it deconstructs; CPS transforms, where a Church encoding resembles a reified pattern match that takes separate continuations for each case; representing data as a collection of operations that use it, usually known as object-oriented programming; and so on.
In your case, what you seem to want is an an abstract type, i.e. one that doesn't export its internal representation, but not a completely sealed one, i.e. that leaves the representation open to functions in the module that defines it. This is the same pattern followed by things like Data.Map.Map. You probably don't want to go the type class route, since it sounds like you need to work with a variety of data points, rather than on an arbitrary choice of a single type of data point.
Most likely, some combination of "smart constructors" to create values, and a variety of deconstruction functions (as described above) exported from the module is the best starting point. Going from there, I expect most of the remaining details should have an obvious approach to take next.
With the latter solution (no type classes), you can export a catamorphism on the type rather than the constructors..
data MyData = PointData Double Double | ControlData Double Double Double | SomeOtherData String Double
foldMyData pf cf sf d = case d of
(PointData x y) -> pf x y
(ControlData x y z) -> cf x y z
(SomeOtherData s x) -> sf s x
That way you have a way to pull your data apart into whatever you want (including just ignoring the values and passing functions that return what type of constructor you used) without providing a general way to construct your data.
I find the type-classes-based approach better as long as you are not going to mix different data points in a single data structure.
The name collision problem you mentioned can be solved by creating a separate type class for each distinct field, like this:
class WithCases p where
cases :: p -> NumberOfCases

Why do a lot of programming languages put the type *after* the variable name?

I just came across this question in the Go FAQ, and it reminded me of something that's been bugging me for a while. Unfortunately, I don't really see what the answer is getting at.
It seems like almost every non C-like language puts the type after the variable name, like so:
var : int
Just out of sheer curiosity, why is this? Are there advantages to choosing one or the other?
There is a parsing issue, as Keith Randall says, but it isn't what he describes. The "not knowing whether it is a declaration or an expression" simply doesn't matter - you don't care whether it's an expression or a declaration until you've parsed the whole thing anyway, at which point the ambiguity is resolved.
Using a context-free parser, it doesn't matter in the slightest whether the type comes before or after the variable name. What matters is that you don't need to look up user-defined type names to understand the type specification - you don't need to have understood everything that came before in order to understand the current token.
Pascal syntax is context-free - if not completely, at least WRT this issue. The fact that the variable name comes first is less important than details such as the colon separator and the syntax of type descriptions.
C syntax is context-sensitive. In order for the parser to determine where a type description ends and which token is the variable name, it needs to have already interpreted everything that came before so that it can determine whether a given identifier token is the variable name or just another token contributing to the type description.
Because C syntax is context-sensitive, it very difficult (if not impossible) to parse using traditional parser-generator tools such as yacc/bison, whereas Pascal syntax is easy to parse using the same tools. That said, there are parser generators now that can cope with C and even C++ syntax. Although it's not properly documented or in a 1.? release etc, my personal favorite is Kelbt, which uses backtracking LR and supports semantic "undo" - basically undoing additions to the symbol table when speculative parses turn out to be wrong.
In practice, C and C++ parsers are usually hand-written, mixing recursive descent and precedence parsing. I assume the same applies to Java and C#.
Incidentally, similar issues with context sensitivity in C++ parsing have created a lot of nasties. The "Alternative Function Syntax" for C++0x is working around a similar issue by moving a type specification to the end and placing it after a separator - very much like the Pascal colon for function return types. It doesn't get rid of the context sensitivity, but adopting that Pascal-like convention does make it a bit more manageable.
the 'most other' languages you speak of are those that are more declarative. They aim to allow you to program more along the lines you think in (assuming you aren't boxed into imperative thinking).
type last reads as 'create a variable called NAME of type TYPE'
this is the opposite of course to saying 'create a TYPE called NAME', but when you think about it, what the value is for is more important than the type, the type is merely a programmatic constraint on the data
If the name of the variable starts at column 0, it's easier to find the name of the variable.
Compare
QHash<QString, QPair<int, QString> > hash;
and
hash : QHash<QString, QPair<int, QString> >;
Now imagine how much more readable your typical C++ header could be.
In formal language theory and type theory, it's almost always written as var: type. For instance, in the typed lambda calculus you'll see proofs containing statements such as:
x : A y : B
-------------
\x.y : A->B
I don't think it really matters, but I think there are two justifications: one is that "x : A" is read "x is of type A", the other is that a type is like a set (e.g. int is the set of integers), and the notation is related to "x ε A".
Some of this stuff pre-dates the modern languages you're thinking of.
An increasing trend is to not state the type at all, or to optionally state the type. This could be a dynamically typed langauge where there really is no type on the variable, or it could be a statically typed language which infers the type from the context.
If the type is sometimes given and sometimes inferred, then it's easier to read if the optional bit comes afterwards.
There are also trends related to whether a language regards itself as coming from the C school or the functional school or whatever, but these are a waste of time. The languages which improve on their predecessors and are worth learning are the ones that are willing to accept input from all different schools based on merit, not be picky about a feature's heritage.
"Those who cannot remember the past are condemned to repeat it."
Putting the type before the variable started innocuously enough with Fortran and Algol, but it got really ugly in C, where some type modifiers are applied before the variable, others after. That's why in C you have such beauties as
int (*p)[10];
or
void (*signal(int x, void (*f)(int)))(int)
together with a utility (cdecl) whose purpose is to decrypt such gibberish.
In Pascal, the type comes after the variable, so the first examples becomes
p: pointer to array[10] of int
Contrast with
q: array[10] of pointer to int
which, in C, is
int *q[10]
In C, you need parentheses to distinguish this from int (*p)[10]. Parentheses are not required in Pascal, where only the order matters.
The signal function would be
signal: function(x: int, f: function(int) to void) to (function(int) to void)
Still a mouthful, but at least within the realm of human comprehension.
In fairness, the problem isn't that C put the types before the name, but that it perversely insists on putting bits and pieces before, and others after, the name.
But if you try to put everything before the name, the order is still unintuitive:
int [10] a // an int, ahem, ten of them, called a
int [10]* a // an int, no wait, ten, actually a pointer thereto, called a
So, the answer is: A sensibly designed programming language puts the variables before the types because the result is more readable for humans.
I'm not sure, but I think it's got to do with the "name vs. noun" concept.
Essentially, if you put the type first (such as "int varname"), you're declaring an "integer named 'varname'"; that is, you're giving an instance of a type a name. However, if you put the name first, and then the type (such as "varname : int"), you're saying "this is 'varname'; it's an integer". In the first case, you're giving an instance of something a name; in the second, you're defining a noun and stating that it's an instance of something.
It's a bit like if you were defining a table as a piece of furniture; saying "this is furniture and I call it 'table'" (type first) is different from saying "a table is a kind of furniture" (type last).
It's just how the language was designed. Visual Basic has always been this way.
Most (if not all) curly brace languages put the type first. This is more intuitive to me, as the same position also specifies the return type of a method. So the inputs go into the parenthesis, and the output goes out the back of the method name.
I always thought the way C does it was slightly peculiar: instead of constructing types, the user has to declare them implicitly. It's not just before/after the variable name; in general, you may need to embed the variable name among the type attributes (or, in some usage, to embed an empty space where the name would be if you were actually declaring one).
As a weak form of pattern-matching, it is intelligable to some extent, but it doesn't seem to provide any particular advantages, either. And, trying to write (or read) a function pointer type can easily take you beyond the point of ready intelligability. So overall this aspect of C is a disadvantage, and I'm happy to see that Go has left it behind.
Putting the type first helps in parsing. For instance, in C, if you declared variables like
x int;
When you parse just the x, then you don't know whether x is a declaration or an expression. In contrast, with
int x;
When you parse the int, you know you're in a declaration (types always start a declaration of some sort).
Given progress in parsing languages, this slight help isn't terribly useful nowadays.
Fortran puts the type first:
REAL*4 I,J,K
INTEGER*4 A,B,C
And yes, there's a (very feeble) joke there for those familiar with Fortran.
There is room to argue that this is easier than C, which puts the type information around the name when the type is complex enough (pointers to functions, for example).
What about dynamically (cheers #wcoenen) typed languages? You just use the variable.

Resources