Including comments in AST - haskell

I'm planning on writing a Parser for some language. I'm quite confident that I could cobble together a parser in Parsec without too much hassle, but I thought about including comments into the AST so that I could implement a code formatter in the end.
At first, adding an extra parameter to the AST types seemed like a suitable idea (this is basically what was suggested in this answer). For example, instead of having
data Expr = Add Expr Expr | ...
one would have
data Expr a = Add a Expr Expr
and use a for whatever annotation (e.g. for comments that come after the expression).
However, there are some not so exciting cases. The language features C-like comments (// ..., /* .. */) and a simple for loop like this:
for (i in 1:10)
{
... // list of statements
}
Now, excluding the body there are at least 10 places where one could put one (or more) comments:
/*A*/ for /*B*/ ( /*C*/ i /*E*/ in /*F*/ 1 /*G*/ : /*H*/ 10 /*I*/ ) /*J*/
{ /*K*/
...
In other words, while the for loop could previously be comfortably represented as an identifier (i), two expressions (1 & 10) and a list of statements (the body), we would now at least had to include 10 more parameters or records for annotations.
This get ugly and confusing quite quickly, so I wondered whether there is a clear better way to handle this. I'm certainly not the first person wanting to write a code formatter that preserves comments, so there must be a decent solution or is writing a formatter just that messy?

You can probably capture most of those positions with just two generic comment productions:
Expr -> Comment Expr
Stmt -> Comment Stmt
This seems like it ought to capture comments A, C, F, H, J, and K for sure; possibly also G depending on exactly what your grammar looks like. That only leaves three spots to handle in the for production (maybe four, with one hidden in Range here):
Stmt -> "for" Comment "(" Expr Comment "in" Range Comment ")" Stmt
In other words: one before each literal string but the first. Seems not too onerous, ultimately.

Related

Permutation parsing with megaparsec + parser-combinators too lenient

I'm attempting to parse permutations of flags. The behavior I want is "one or more flags in any order, without repetition". I'm using the following packages:
megaparsec
parser-combinators
The code I have is outputting what I want, but is too lenient on inputs. I don't understand why it's accepting multiples of the same flags. What am I doing wrong here?
pFlags :: Parser [Flag]
pFlags = runPermutation $ f <$>
toPermutation (optional (GroupFlag <$ char '\'')) <*>
toPermutation (optional (LeftJustifyFlag <$ char '-'))
where f a b = catMaybes [a, b]
Examples:
"'-" = [GroupFlag, LeftJustifyFlag] -- CORRECT
"-'" = [LeftJustifyFlag, GroupFlag] -- CORRECT
"''''-" = [GroupFlag, LeftJustifyFlag] -- INCORRECT, should fail if there's more than one of the same flag.
Instead of toPermutation with optional, I believe you need to use toPermutationWithDefault, something like this (untested):
toPermutationWithDefault Nothing (Just GroupFlag <$ char '\'')
The reasoning is described in the paper “Parsing Permutation Phrases” (PDF) in §4, “adding optional elements” (emph. added):
Consider, for example […] all permutations of a, b and c. Suppose b can be empty and we want to recognise ac. This can be done in three different ways since the empty b can be recognised before a, after a or after c. Fortunately, it is irrelevant for the result of a parse where exactly the empty b is derived, since order is not important. This allows us to use a strategy similar to the one proposed by Cameron: parse nonempty constituents as they are seen and allow the parser to stop if all remaining elements are optional. When the parser stops the default values are returned for all optional elements that have not been recognised.
To implement this strategy we need to be able to determine whether a parser can derive the empty string and split it into its default value and its non-empty part, i.e. a parser that behaves the same except that it does not recognise the empty string.
That is, the permutation parser needs to know which elements can succeed without consuming input, otherwise it will be too eager to commit to a branch. I don’t know why this would lead to accepting multiples of an element, though; perhaps you’re also missing an eof?

Haskell what does the ' symbol do?

As the title states, I see pieces of code online where the variables/functions have ' next to it, what does this do/mean?
ex:
function :: [a] -> [a]
function ...
function' :: ....
The notation comes from mathematics. It is read x prime. In pretty much any math manual you can find something like let x be a number and x' be the projection of ... (math stuff).
Why not using another convention? well, in mathematics It makes a lot of sense because It can be very pedagogical... In programming we aren't used to this convention so I don't see the point of using it, but I am not against it neither.
Just to give you an example of its use in mathematics so you can understand why It is used in Haskell. Below, the same triangle concept but one using prime convention and other not using it. It is pretty clear in the first picture that pairs (A, A'), (B, B'), ... are related by one being the vertex and the prime version being the midpoint of the oposite edge. Whereas in the second example, you just have to remember that A is the midpoint of the oposite edge of vertex P. First is easier and more pedagogical:
As the other answers said, function' is just another variable name. So,
don'tUse :: Int -> IO ()
don'tUse won'tBe''used'' = return ()
is just like
dontUse :: Int -> IO ()
dontUse wontBeUsed = return ()
with slightly different names. The only requirement is that the name starts with a lowercase-letter or underscore, after that you can have as many single-quote characters as you want.
Prelude> let _' = 1
Prelude> let _'' = 2
Prelude> let _''''''''' = 9
Prelude> _' + _'' * _'''''''''
19
...Of course it's not necessarily a good idea to name variables like that; normally such prime-names are used when making a slightly different version of an already named thing. For example, foldl and foldl' are functions with the same signature that do essentially the same thing, only with different strictness (which often affects performance memory usage and whether infinite inputs are allowed, but not the actual results).
That said, to the question
Haskell what does the ' symbol do?
– the ' symbol does in fact do various other things as well, but only when it appears not as a non-leading character in a name.
'a' is a character literal.
'Foo is a constructor used on the type level. See DataKinds.
'bar and ''Baz are quoted names. See TemplateHaskell.

"Embedding/inheriting" one `data` constructor in another?

Consider the following fragment:
data File
= NoFile
| FileInfo {
path :: FilePath,
modTime :: Data.Time.Clock.UTCTime
}
| FileFull {
path :: FilePath,
modTime :: Data.Time.Clock.UTCTime,
content :: String
}
deriving Eq
That duplication is a bit of a "wart", though in this one-off instance not particularly painful. In order to further improve my understanding of Haskell's rich type system, what might be preferred "clean"/"idiomatic" approaches for refactoring other than either simply creating a separate data record type for the 2 duplicate fields (then replacing them with single fields of that new data type) or replacing the FileFull record notation with something like | FileFull File String, which wouldn't be quite clean either (as here one would only want FileInfo in there for example, not NoFile)?
(Both these "naive" approaches would be somewhat intrusive/annoying with respect to having to then fix up many modules manually throughout the rest of the code-base here.)
One thing I considered would be parameterizing like so:
data File a
= NoFile
| FileMaybeWithContent {
path :: FilePath,
modTime :: Data.Time.Clock.UTCTime
content :: a
}
deriving Eq
Then for those "just info, not loaded" contexts a would be (), otherwise String. Seems too general anyway, we want either String or nothing, leading us to Maybe, doing once again away with the a parameter.
Of course we've been there before: content could just be done with Maybe String of course, then "refactor any compile errors away" and "done". That'll probably be the order of the day, but knowing Haskell and the many funky GHC extensions.. who knows just what exotic theoretic trick/axiom/law I've been missing, right?! See, the differently-named "semantic insta-differentiator" between a "just meta-data info" value and a "file content with meta info" value does work well throughout the rest of the code-base as far as eased comprehension.
(And yes, I perhaps should have removed NoFile and used Maybe Files throughout, but then... not sure whether there's really a solid reason to do so and a different question altogether anyway..)
All of the following are equivalent/isomorphic, as I think you've discovered:
data F = U | X A B | Y A B C
data F = U | X AB | Y AB C
data AB = AB A B
data F = U | X A B (Maybe C)
So the color of the bike shed really depends on the context (e.g. do you have use for an AB elsewhere?) and your own aesthetic preferences.
It might clarify things and help you understand what you're doing to have some sense of the algebra of algebraic data types
We call types like Either "sum types" and types like (,) "product types" and they are subject to the same kinds of transformations you're familiar with like factoring
f = 1 + (a * b) + (a * b * c)
= 1 + ((a * b) * ( 1 + c))
As others have noted, the NoFile constructor is probably not necessary, but you can keep it if you want. If you feel your code is more readable and/or better understood with it, then I say keep it.
Now the trick with combining the other two constructors is by hiding the content field. You were on the right track by parameterizing File, but that alone isn't enough since then we can have File Foo, File Bar, etc. Fortunately, GHC has some nifty ways to help us.
I'll write out the code here and then explain how it works.
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE DataKinds #-}
import Data.Void
data Desc = Info | Full
type family Content (a :: Desc) where
Content Full = String
Content _ = Void
data File a = File
{ path :: FilePath
, modTime :: UTCTime
, content :: Content a
}
There are a few things going on here.
First, note that in the File record, the content field now has type Content a instead of just a. Content is a type family, which is (in my opinion) a confusing name for type-level function. That is, the compiler replaces Content a with some other type based on what a is and how we've defined Content.
We defined Content Full to be String, so that when we have a value f1 :: File Full, its content field will have a String value. On the other hand, f2 :: File Info will have a content field with type Void which has no values.
Cool right? But what's preventing us from having File Foo now?
That's where DataKinds comes to the rescue. It "promotes" the data type Desc to a kind (the type of types in Haskell) and type constructors ,Info and Full, to types of kind Desc instead of merely values of type Desc.
Notice in the declaration of Content that I have annotated a. It looks like a type annotation, but a is already a type. This is a kind annotation. It forces a to be something of kind Desc and the only types of kind Desc are Info and Full.
By now you're probably totally sold on how awesome this is, but I should warn you there's no free lunch. In particular, this is a compile-time construction. Your single File type becomes two different types. This can cause other related logic (producers and consumers of File records) to become complicated. If your use case doesn't mix File Info records with File Full records, then this is the way to go. On the other hand, if you want to do something like have a list of File records which can be a mixture of both types, then you're better off just making the type of your content field Maybe String.
Another thing is, how exactly do you make a File Info since there's no value of Void to use for the content field? Well, technically it should be ok to use undefined or error "this should never happen" since it is (morally) impossible to have a function of type Void -> a, but if that makes you feel uneasy (and it probably should), then just replace Void with (). Unit is almost as useless and doesn't require 'values' of bottom.

SML - How to write a Function that converts a dataType to String?

I am currently working on a homework for a lecture in which I have to write a Interpreter for a language we defined in the lecture. Part of my homework is to write a function that will take an expression from my datatype and turn it into a string.
Here is my datatype so far:
datatype basicType = voidType | intType | boolType;
datatype etype = basicType
| tupelType of (etype * etype)
| functionType of etype -> basicType;
Note that I am not even sure that this datatype is correct.
I am allowed to use the SML functions String.concatWith and Int.toString.
Since this is a homework I should do I like to only get a start and tips and tricks. No full blown solutions please.
I am thankfull for any input from you girls/guys since I am about to lose my sanity from this lecture.
You need to implement a function that can print all variants of etype by recursing into each value. You also need a helper function for printing basicType values, bot it does not need to be recursive but instead act as the base case in your recursion.
Both of them need one function body for each variant of the respective datatype. The body handling the functionType would also need to apply the function.
You should also think about addnig some extra string output in each function body to make it possible for the reader to distinguish between the different types.
The tupelType could for example be printed something like this: (a, b) where a and b are recursive calls into each of the tuple elements.

What makes a good name for a helper function?

Consider the following problem: given a list of length three of tuples (String,Int), is there a pair of elements having the same "Int" part? (For example, [("bob",5),("gertrude",3),("al",5)] contains such a pair, but [("bob",5),("gertrude",3),("al",1)] does not.)
This is how I would implement such a function:
import Data.List (sortBy)
import Data.Function (on)
hasPair::[(String,Int)]->Bool
hasPair = napkin . sortBy (compare `on` snd)
where napkin [(_, a),(_, b),(_, c)] | a == b = True
| b == c = True
| otherwise = False
I've used pattern matching to bind names to the "Int" part of the tuples, but I want to sort first (in order to group like members), so I've put the pattern-matching function inside a where clause. But this brings me to my question: what's a good strategy for picking names for functions that live inside where clauses? I want to be able to think of such names quickly. For this example, "hasPair" seems like a good choice, but it's already taken! I find that pattern comes up a lot - the natural-seeming name for a helper function is already taken by the outer function that calls it. So I have, at times, called such helper functions things like "op", "foo", and even "helper" - here I have chosen "napkin" to emphasize its use-it-once, throw-it-away nature.
So, dear Stackoverflow readers, what would you have called "napkin"? And more importantly, how do you approach this issue in general?
General rules for locally-scoped variable naming.
f , k, g, h for super simple local, semi-anonymous things
go for (tail) recursive helpers (precedent)
n , m, i, j for length and size and other numeric values
v for results of map lookups and other dictionary types
s and t for strings.
a:as and x:xs and y:ys for lists.
(a,b,c,_) for tuple fields.
These generally only apply for arguments to HOFs. For your case, I'd go with something like k or eq3.
Use apostrophes sparingly, for derived values.
I tend to call boolean valued functions p for predicate. pred, unfortunately, is already taken.
In cases like this, where the inner function is basically the same as the outer function, but with different preconditions (requiring that the list is sorted), I sometimes use the same name with a prime, e.g. hasPairs'.
However, in this case, I would rather try to break down the problem into parts that are useful by themselves at the top level. That usually also makes naming them easier.
hasPair :: [(String, Int)] -> Bool
hasPair = hasDuplicate . map snd
hasDuplicate :: Ord a => [a] -> Bool
hasDuplicate = not . isStrictlySorted . sort
isStrictlySorted :: Ord a => [a] -> Bool
isStrictlySorted xs = and $ zipWith (<) xs (tail xs)
My strategy follows Don's suggestions fairly closely:
If there is an obvious name for it, use that.
Use go if it is the "worker" or otherwise very similar in purpose to the original function.
Follow personal conventions based on context, e.g. step and start for args to a fold.
If all else fails, just go with a generic name, like f
There are two techniques that I personally avoid. One is using the apostrophe version of the original function, e.g. hasPair' in the where clause of hasPair. It's too easy to accidentally write one when you meant the other; I prefer to use go in such cases. But this isn't a huge deal as long as the functions have different types. The other is using names that might connote something, but not anything that has to do with what the function actually does. napkin would fall into this category. When you revisit this code, this naming choice will probably baffle you, as you will have forgotten the original reason that you named it napkin. (Because napkins have 4 corners? Because they are easily folded? Because they clean up messes? They're found at restaurants?) Other offenders are things like bob and myCoolFunc.
If you have given a function a name that is more descriptive than go or h, then you should be able to look at either the context in which it is used, or the body of the function, and in both situations get a pretty good idea of why that name was chosen. This is where my point #3 comes in: personal conventions. Much of Don's advice applies. If you are using Haskell in a collaborative situation, then coordinate with your team and decide on certain conventions for common situations.

Resources