Parsec vs Yacc/Bison/Antlr: Why and when to use Parsec?

Parsec vs Yacc/Bison/Antlr: Why and when to use Parsec? - haskell

I'm new to Haskell and Parsec. After reading Chapter 16 Using Parsec of Real World Haskell, a question appeared in my mind: Why and when is Parsec better than other parser generators like Yacc/Bison/Antlr?
My understanding is that Parsec creates a nice DSL of writing parsers and Haskell makes it very easy and expressive. But parsing is such a standard/popular technology that deserves its own language, which outputs to multiple target languages. So when shall we use Parsec instead of, say, generating Haskell code from Bison/Antlr?
This question might go a little beyond technology, and into the realm of industry practice. When writing a parser from scratch, what's the benefit of picking up Haskell/Parsec compared to Bison/Antlr or something similar?
BTW: my question is quite similar to this one but wasn't answered satisfactorily there.

One of the main differences between the tools you listed, is that ANTLR, Bison and their friends are parser generators, whereas Parsec is a parser combinator library.
A parser generator reads in a description of a grammar and spits out a parser. It is generally not possible to combine existing grammars into a new grammar, and it is certainly not possible to combine two existing generated parsers into a new parser.
A parser combinator OTOH does nothing but combine existing parsers into new parsers. Usually, a parser combinator library ships with a couple of trivial built-in parsers that can parse the empty string or a single character, and it ships with a set of combinators that take 1 or more parsers and return a new one that, for example, parses the sequence of the original parsers (e.g. you can combine a d parser and an o parser to form a do parser), the alternation of the original parsers (e.g. a 0 parser and a 1 parser to a 0|1 parser) or parses the original parse multiple times (repetetion).
What this means is that you could, for example, take an existing parser for Java and an existing parser for HTML and combine them into a parser for JSP.
Most parser generators don't support this, or only support it in a limited way. Parser combinators OTOH only support this and nothing else.

You might want to see this question as well as the linked one in your question.
Which Haskell parsing technology is most pleasant to use, and why?
In Haskell the competition is between Parsec (and other parser combinators) and the parser generator Happy. I'd pick Happy if I already had an LR grammar to work from - parser combinators take grammars in LL form and the translation from LR to LL takes some effort and a combinator parser will usually be significantly slower. If I don't have a grammar I'll use Parsec, it is more flexible (powerful) than Happy and its more fun to work "in Haskell" than generate code with Happy and Alex. If you use Happy for parsing you almost always need to use Alex for lexing.
For industry practice, it would be odd to decide to use Haskell just to get Parsec. For parsing, most of the current crop of languages will have at least a parser generator and probably something more flexible like a port of Parsec or a PEG system.
Ira Baxter's answer to the linked question was spot-on about a parser getting you merely to the foothold of the Himalayas for writing a translator, but being part of a translator is only one of the uses for a parser, so there are still many domains where fairly minimalist systems like ANTLR, Happy and Parsec are satisfactory.

Following on from stephen's answer, I think that one of the most common alternatives to Parsec, if you want to stick with parser combinators, is attoparsec. The main difference is that attoparsec was written with more of a bias towards speed, and makes trade-offs accordingly. For example, Parsec does some book-keeping to try to return helpful error messages if a parse fails, which attoparsec doesn't do to the same extent. Also, I think that attoparsec is specialised to one input stream/token type, whereas Parsec abstracts from the input type so that it can parse streams of type String, ByteString, Text, etc. without problem.

Related

Is it possible to emit raw source code with Template Haskell?

Say I have a String (or Text or whatever) containing valid Haskell code. Is there a way to convert it into a [Dec] with Template Haskell?
I'm pretty sure the AST doesn't directly go to GHC so there's going to be a printing and then a parsing stage anyways.
This would be great to have since it would allow different "backends" for TH. For example you could use the AST from haskell-src-exts which supports more Haskell syntax than TH does.

I'm pretty sure the AST doesn't directly go to GHC so there's going to be a printing and then a parsing stage anyways.
Why would you think that? That isn’t the case, the TH AST is converted to GHC’s internal AST directly; it never gets converted back to text at any point in that process. (If it did, that would be pretty strange.)
Still, it would be somewhat nice if Template Haskell exposed a way to parse Haskell source to expressions, types, and declarations, basically exposing the parsers behind various e, t, and d quoters that are built in to Template Haskell. Unfortunately, it does not, and I don’t believe there are currently any plans to change that.
Currently, you need to go through haskell-src-exts instead. This is somewhat less than ideal, since there are differences between haskell-src-exts’s parser and GHCs, but it’s as good as you’re currently going to get. To lessen the pain, there is a package called haskell-src-meta that bridges haskell-src-exts and template-haskell.
For your use case, you can use the parseDecs function from Language.Haskell.Meta.Parse, which has the type String -> Either String [Dec], which is what you’re looking for.

How to build Abstract Syntax Trees from grammar specification in Haskell?

I'm working on a project which involves optimizing certain constructs in a very small subset of Java, formalized in BNF.
If I were to do this in Java, I would use a combination of JTB and JavaCC which builds an AST. Visitors are then used to manipulate the tree. But, given the vast libraries for parsing in Haskell (parsec, happy, alex etc), I'm a bit confused in chossing the appropriate library.
So, simply put, when a language is specified in BNF, which library offers the easiest means to build an AST? And what is the best way to go about modifying this tree in idiomatic Haskell?

Well in Haskell there are 2 main ways of parsing something, parse combinators or a parser generator. Since you already have a BNF I'd suggest the latter.
A good one is alex. GHC's parser IIRC is written using this so you'd be in good company.
Next you'll have a big honking stack of data declarations to parse into:
data JavaClass = {
className :: Name,
interfaces :: [Name],
contents :: [ClassContents],
...
}
data ClassContents = M Method
| F Field
| IC InnerClass
and for expressions and whatever else you need. Finally you'll combine these into something like
data TopLevel = JC JavaClass
| WhateverOtherForms
| YouWillParse
Once you have this you'll have the entire AST represented as one TopLevel or a list of them depending on how many you classes/files you parse.
To proceed from here depends on what you want to do. There are a number of libraries such as syb (scrap your boilerplate) that let you write very concise tree traversals and modifications. lens is also an option. At a minimum check out Data.Traversable and Data.Foldable.
To modify the tree, you can do something as simple as
ignoreInnerClasses :: JavaClass -> JavaClass
ignoreInnerContents c = c{contents = filter isClass $ contents c}
-- ^^^ that is called a record update
where isClass (IC _) = True
isClass _ = False
and then you could potentially use something like syb to write
everywhere (mkT ignoreInnerClass) toplevel
which will traverse everything and apply ignoreInnerClass to all JavaClasses. This is possible to do in lens and many other libraries too, but syb is very easy to read.

I've never used bnfc-meta (suggested by #phg), but I would strongly recommend you look into BNFC (on hackage: http://hackage.haskell.org/package/BNFC). The basic approach is that you write your grammar in an annotated BNF style, and it will automatically generate an AST, parser, and pretty-printer for the grammar.
How suitable BNFC is depends upon the complexity of your grammar. If it's not context-free, you'll likely have a difficult time making any progress (I did make some success hacking up context-sensitive extensions, but that code's likely bit-rotted by now). The other downside is that your AST will very directly reflect the grammar specification. But since you already have a BNF specification, adding the necessary annotations for BNFC should be rather straightforward, so it's probably the fastest way to get a usable AST. Even if you decide to go a different route, you might be able to take the generated data types as a starting point for a hand-written version.

Alex + Happy.
There are many approaches to modify/investigate the parsed terms (ASTs). The keyword to search for is "datatype-generic" programming. But beware: it is a complex topic ...
http://people.cs.uu.nl/andres/Rec/MutualRec.pdf
http://www.cs.uu.nl/wiki/GenericProgramming/Multirec
It has a generic implementation of the zipper available here:
http://hackage.haskell.org/packages/archive/zipper/0.3/doc/html/Generics-MultiRec-Zipper.html
Also checkout https://github.com/pascalh/Astview

You might also check out the Haskell Compiler Series which is nice as an introduction to using alex and happy to parse a subset of Java: http://bjbell.wordpress.com/haskell-compiler-series/.

Since your grammar can be expressed in BNF, it is in the class of grammars that are efficiently parseable with a shift-reduce parser (LALR grammars). Such efficient parsers can be generated by the parser generator yacc/bison (C,C++), or its Haskell equivalent "Happy".
That's why I would use "Happy" in your case. It takes grammar rules in BNF form and generates a parser from it directly. The resulting parser will accept the language that is described by your grammar rules and produce an AST (abstract syntax tree). The Happy user guide is quite nice and gets you started quickly:
http://www.haskell.org/happy/doc/html/
To transform the resulting AST, generic programming is a good idea. Here is a classical explanation on how to do this in Haskell in a practical fashion, from scratch:
http://research.microsoft.com/en-us/um/people/simonpj/papers/hmap/
I have used exactly this to build a compiler for a small domain specific language, and it was a simple and concise solution.

Haskell parser combinator for Haskell identifier

I'm writing an anti-quoter in Haskell and I need a Parsec combinator that parses a valid Haskell variable identifier.
Is there one already implemented in the quasiquoting libraries or do I need to write my own?
I'm hoping I don't need to copy/paste the ident implementation found in http://www.haskell.org/haskellwiki/Quasiquotation.

It's unlikely that anything in the implementation of Template Haskell itself contains a Parsec parser for anything, because GHC does not use Parsec for parsing--note that it's not in the list of packages tied to GHC in various ways.
However, the module Text.Parsec.Token gives a means of describing full token parsers for languages, and the Text.Parsec.Language module includes some predefined token parsers, including one for Haskell tokens.
Beyond that, you could also look at the haskell-src-exts package, which is a parser for Haskell source files.

Is there any way to make parsec report "shift-reduce" conflicts?

I'm playing around with parsec and realized that I had an ambiguous grammar. Obviously that's an error on my part, but I'm sort of used to yacc-style parser generators letting me know I'm being dumb. Parsec just eats characters in the order you give it parsers (yeah, I know about try).
Is there any way to make parsec tell me when my grammar isn't left-factored? Programs that do work for me are great.
Thanks!
(I know that shift-reduce has to do with a different kind of parser technology. I simply mean to describe ambiguous grammars.)

I am not a Parsec expert, so I'm likely to be corrected, but I don't think this is possible, for the simple reason that Parsec knows nothing about your grammar.
Or put another way, while your grammar may be ambiguous, your Parsec parser is not, and a program has no way of determining that some other arrangement of parsec combinators, which produces a different output for equivalent input, is also a valid representation of an unspecified grammar.
Since you do have a grammar, you might prefer to use happy and alex, which will give you a much more lexx/yacc-like experience.
An interesting project might be to adapt the BNFC to produce an AST of parsec combinators to represent a grammar, but I suspect this would be a non-trivial task.

Write a Haskell interpreter in Haskell

A classic programming exercise is to write a Lisp/Scheme interpreter in Lisp/Scheme. The power of the full language can be leveraged to produce an interpreter for a subset of the language.
Is there a similar exercise for Haskell? I'd like to implement a subset of Haskell using Haskell as the engine. Of course it can be done, but are there any online resources available to look at?
Here's the backstory.
I am exploring the idea of using Haskell as a language to explore some of the concepts in a Discrete Structures course I am teaching. For this semester I have settled on Miranda, a smaller language that inspired Haskell. Miranda does about 90% of what I'd like it to do, but Haskell does about 2000%. :)
So my idea is to create a language that has exactly the features of Haskell that I'd like and disallows everything else. As the students progress, I can selectively "turn on" various features once they've mastered the basics.
Pedagogical "language levels" have been used successfully to teach Java and Scheme. By limiting what they can do, you can prevent them from shooting themselves in the foot while they are still mastering the syntax and concepts you are trying to teach. And you can offer better error messages.

I love your goal, but it's a big job. A couple of hints:
I've worked on GHC, and you don't want any part of the sources. Hugs is a much simpler, cleaner implementation but unfortunately it's in C.
It's a small piece of the puzzle, but Mark Jones wrote a beautiful paper called Typing Haskell in Haskell which would be a great starting point for your front end.
Good luck! Identifying language levels for Haskell, with supporting evidence from the classroom, would be of great benefit to the community and definitely a publishable result!

There is a complete Haskell parser: http://hackage.haskell.org/package/haskell-src-exts
Once you've parsed it, stripping out or disallowing certain things is easy. I did this for tryhaskell.org to disallow import statements, to support top-level definitions, etc.
Just parse the module:
parseModule :: String -> ParseResult Module
Then you have an AST for a module:
Module SrcLoc ModuleName [ModulePragma] (Maybe WarningText) (Maybe [ExportSpec]) [ImportDecl] [Decl]
The Decl type is extensive: http://hackage.haskell.org/packages/archive/haskell-src-exts/1.9.0/doc/html/Language-Haskell-Exts-Syntax.html#t%3ADecl
All you need to do is define a white-list -- of what declarations, imports, symbols, syntax is available, then walk the AST and throw a "parse error" on anything you don't want them to be aware of yet. You can use the SrcLoc value attached to every node in the AST:
data SrcLoc = SrcLoc
{ srcFilename :: String
, srcLine :: Int
, srcColumn :: Int
}
There's no need to re-implement Haskell. If you want to provide more friendly compile errors, just parse the code, filter it, send it to the compiler, and parse the compiler output. If it's a "couldn't match expected type a against inferred a -> b" then you know it's probably too few arguments to a function.
Unless you really really want to spend time implementing Haskell from scratch or messing with the internals of Hugs, or some dumb implementation, I think you should just filter what gets passed to GHC. That way, if your students want to take their code-base and take it to the next step and write some real fully fledged Haskell code, the transition is transparent.

Do you want to build your interpreter from scratch? Begin with implementing an easier functional language like the lambda calculus or a lisp variant. For the latter there is a quite nice wikibook called Write yourself a Scheme in 48 hours giving a cool and pragmatic introduction into parsing and interpretation techniques.
Interpreting Haskell by hand will be much more complex since you'll have to deal with highly complex features like typeclasses, an extremely powerful type system (type-inference!) and lazy-evaluation (reduction techniques).
So you should define a quite little subset of Haskell to work with and then maybe start by extending the Scheme-example step by step.
Addition:
Note that in Haskell, you have full access to the interpreters API (at least under GHC) including parsers, compilers and of course interpreters.
The package to use is hint (Language.Haskell.*). I have unfortunately neither found online tutorials on this nor tried it out by myself but it looks quite promising.

create a language that has exactly the features of Haskell that I'd like and disallows everything else. As the students progress, I can selectively "turn on" various features once they've mastered the basics.
I suggest a simpler (as in less work involved) solution to this problem. Instead of creating a Haskell implementation where you can turn features off, wrap a Haskell compiler with a program that first checks that the code doesn't use any feature you disallow, and then uses the ready-made compiler to compile it.
That would be similar to HLint (and also kind of its opposite):
HLint (formerly Dr. Haskell) reads Haskell programs and suggests changes that hopefully make them easier to read. HLint also makes it easy to disable unwanted suggestions, and to add your own custom suggestions.
Implement your own HLint "suggestions" to not use the features you don't allow
Disable all the standard HLint suggestions.
Make your wrapper run your modified HLint as a first step
Treat HLint suggestions as errors. That is, if HLint "complained" then the program doesn't proceed to compilation stage

Baskell is a teaching implementation, http://hackage.haskell.org/package/baskell
You might start by picking just, say, the type system to implement. That's about as complicated as an interpreter for Scheme, http://hackage.haskell.org/package/thih

The EHC series of compilers is probably the best bet: it's actively developed and seems to be exactly what you want - a series of small lambda calculi compilers/interpreters culminating in Haskell '98.
But you could also look at the various languages developed in Pierce's Types and Programming Languages, or the Helium interpreter (a crippled Haskell intended for students http://en.wikipedia.org/wiki/Helium_(Haskell)).

If you're looking for a subset of Haskell that's easy to implement, you can do away with type classes and type checking. Without type classes, you don't need type inference to evaluate Haskell code.
I wrote a self-compiling Haskell subset compiler for a Code Golf challenge. It takes Haskell subset code on input and produces C code on output. I'm sorry there isn't a more readable version available; I lifted nested definitions by hand in the process of making it self-compiling.
For a student interested in implementing an interpreter for a subset of Haskell, I would recommend starting with the following features:
Lazy evaluation. If the interpreter is in Haskell, you might not have to do anything for this.
Function definitions with pattern-matched arguments and guards. Only worry about variable, cons, nil, and _ patterns.
Simple expression syntax:
Integer literals
Character literals
[] (nil)
Function application (left associative)
Infix : (cons, right associative)
Parenthesis
Variable names
Function names
More concretely, write an interpreter that can run this:
-- tail :: [a] -> [a]
tail (_:xs) = xs
-- append :: [a] -> [a] -> [a]
append [] ys = ys
append (x:xs) ys = x : append xs ys
-- zipWith :: (a -> b -> c) -> [a] -> [b] -> [c]
zipWith f (a:as) (b:bs) = f a b : zipWith f as bs
zipWith _ _ _ = []
-- showList :: (a -> String) -> [a] -> String
showList _ [] = '[' : ']' : []
showList show (x:xs) = '[' : append (show x) (showItems show xs)
-- showItems :: (a -> String) -> [a] -> String
showItems show [] = ']' : []
showItems show (x:xs) = ',' : append (show x) (showItems show xs)
-- fibs :: [Int]
fibs = 0 : 1 : zipWith add fibs (tail fibs)
-- main :: String
main = showList showInt (take 40 fibs)
Type checking is a crucial feature of Haskell. However, going from nothing to a type-checking Haskell compiler is very difficult. If you start by writing an interpreter for the above, adding type checking to it should be less daunting.

You might look at Happy (a yacc-like parser in Haskell) which has a Haskell parser.

This might be a good idea - make a tiny version of NetLogo in Haskell. Here is the tiny interpreter.

see if helium would make a better base to build upon than standard haskell.

Uhc/Ehc is a series of compilers enabling/disabling various Haskell features.
http://www.cs.uu.nl/wiki/Ehc/WebHome#What_is_UHC_And_EHC

I've been told that Idris has a fairly compact parser, not sure if it's really suitable for alteration, but it's written in Haskell.

Andrej Bauer's Programming Language Zoo has a small implementation of a purely functional programming language somewhat cheekily named "minihaskell". It is about 700 lines of OCaml, so very easy to digest.
The site also contains toy versions of ML-style, Prolog-style and OO programming languages.

Don't you think it would be easier to take the GHC sources and strip out what you don't want, than it would be to write your own Haskell interpreter from scratch? Generally speaking, there should be a lot less effort involved in removing features as opposed to creating/adding features.
GHC is written in Haskell anyway, so technically that stays with your question of a Haskell interpreter written in Haskell.
It probably wouldn't be too hard to make the whole thing statically linked and then only distribute your customized GHCi, so that the students can't load other Haskell source modules. As to how much work it would take to prevent them from loading other Haskell object files, I have no idea. You might want to disable FFI too, if you have a bunch of cheaters in your classes :)

The reason why there are so many LISP interpreters is that LISP is basically a predecessor of JSON: a simple format to encode data. This makes the frontend part quite easy to handle. Compared to that, Haskell, especially with Language Extensions, is not the easiest language to parse.
These are some syntactical constructs that sound tricky to get right:
operators with configurable precedence, associativity, and fixity,
nested comments
layout rule
pattern syntax
do- blocks and desugaring to monadic code
Each of these, except maybe the operators, could be tackled by students after their Compiler Construction Course, but it would take the focus away from how Haskell actually works. In addition to that, you might not want to implement all syntactical constructs of Haskell directly, but instead implement passes to get rid of them. Which brings us to the literal core of the issue, pun fully intended.
My suggestion is to implement typechecking and an interpreter for Core instead of full Haskell. Both of these tasks are quite intricate by themselves already.
This language, while still a strongly typed functional language, is way less complicated to deal with in terms of optimization and code generation.
However, it is still independent from the underlying machine.
Therefore, GHC uses it as an intermediary language and translates most syntaxical constructs of Haskell into it.
Additionally, you should not shy away from using GHC's (or another compiler's) frontend.
I'd not consider that as cheating since custom LISPs use the host LISP system's parser (at least during bootstrapping). Cleaning up Core snippets and presenting them to students, along with the original code, should allow you to give an overview of what the frontend does, and why it is preferable to not reimplement it.
Here are a few links to the documentation of Core as used in GHC:
System FC: equality constraints and coercions
GHC/As a library
The Core type

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string