ANTLR4 Parser Rule granularity - antlr4

Say I have a grammar that has a parser rule to match one of several specific strings.
Is the proper thing to do in the grammar to make an alternate parser rule for each specific string, or to keep the parser rule general and decode the string in a visitor subclass?

If the specific strings are meaningful (e.g. a keyword in a DSL) it sounds like you want Tokens. Whatever rules you have in the grammar can reference the Tokens you created.
Generally, it's better to have your grammar do as much of the parser work as possible, rather than overly generalizing and having to write a bunch of extra code.
See the following: http://www.antlr.org/wiki/display/ANTLR4/Grammar+Structure

Related

Antlr4 Lexer with two continuous parentheses

My text is : Function(argument, function(argument)). When I use g4 DEBUGGER to generate the tree. It will works when there is blank between the last two parentheses: Function(argument, function(argument) ). But the debugger will say unexpected '))' when there is not a blank. So, how should I revise my grammar to make it?
It confuses me a lot.
(It will be much easier to confirm the answer to your question if you post the grammar, or at least a stripped down version that demonstrates the issue. Note: Often, stripping down to a minimal example to demonstrate the issue will result in you finding your own answer.)
That said, based upon your error message, I would guess that you have a Lexer rule that matches )). If you do, then ANTLR will match that rule and create that token rather than creating two ) tokens.
Most of the time this mistake originates from not understanding that the Lexer is completely independent of the parser. The Lexer recognizes streams of characters and produces a stream of tokens, and then the parser matched rules against that stream of tokens. In this case, I would guess that the parser rule is looking to match a ) token, but finds the )) token instead. Those are two different tokens, so the parser rule will fail.

Is Java Antlr4 grammar somehow finished?

I noticed that additiveExpression is never used here: https://github.com/antlr/grammars-v4/blob/master/java/java8/Java8Parser.g4#L1276`
(Used only in itself and in shiftExpression)
Is it by design or Java grammar is far from completeness?
I cannot answer if the Java ANTLR4 grammar is finished (when is a grammar ever finished?) and #kaby76 gave you some information, but I can explain why the additiveExpression rule is only used in shiftExpression and nowhere else. The reason is that this is how complex expressions are defined in ANTLR4 grammars. You start with a top level rule and (depending on the precedence) define sub rules that handle smaller sub parts of an expression, thereby distributing the actual work over multiple sub rules that together handle the entire expression. The rule additiveExpression is such a sub rule, responsible for the addition or subtraction of two further subrules (here multiplicativeExpression).
There's no other other situation where this sub rule would be needed. It makes sense only in expression parsing.

Repeating Pattern Matching in antlr4

I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages

Why is antlr4 c grammar parser rule "typeSpecifier" not using lexer rule "Double"?

I am using the antlr4 c grammar as inspiration for my own grammar. I came over one thing I dont really get. Why is there Lexer rules for datatypes when they are not used? For example the rule Double : 'double'; is never used but the parser rule typeSpecifier:('double' | ... );(other datatypes has been removed to simplify) is used several places. Is there a reason why the parser rule typeSpecifier is not using the lexer rule Double?
All the grammars on that page are volunteer submissions and not part of ANTLR4. It's clearly a mistake, but the way lexer rules are matched, it won't make a difference in lexing. You can choose to implement either the explicit rule:
Double : 'double';
or the implicit one:
typeSpecifier
: ('void'
| 'char'
| 'short'
| 'int'
| 'long'
| 'float'
| 'double'
with no ill effects either way, even if you mix methods. In fact, if you take a more global look at that whole grammar, the author did the same thing with numerous other lexer rules, like Register for example. Makes no difference in actual practice.
Bottom line? Choose whichever method you like and apply it consistently. My personal preference is toward brevity, so I like the the implicit tokens so long as they are used in only one place in the grammar. As soon as a token might be used in two places, I prefer to make an explicit token out of it and update the two or more locations where it's used.

Should I use a lexer when using a parser combinator library like Parsec?

When writing a parser in a parser combinator library like Haskell's Parsec, you usually have 2 choices:
Write a lexer to split your String input into tokens, then perform parsing on [Token]
Directly write parser combinators on String
The first method often seems to make sense given that many parsing inputs can be understood as tokens separated by whitespace.
In other places, I have seen people recommend against tokenizing (or scanning or lexing, how some call it), with simplicity being quoted as the main reason.
What are general trade-offs between lexing and not doing it?
The most important difference is that lexing will translate your input domain.
A nice result of this is that
You do not have to think about whitespace anymore. In a direct (non-lexing) parser, you have to sprinkle space parsers in all places where whitespace is allowed to be, which is easy to forget and it clutters your code if whitespace must separate all your tokens anyway.
You can think about your input in a piece-by-piece manner, which is easy for humans.
However, if you do perform lexing, you get the problems that
You cannot use common parsers on String anymore - e.g. for parsing a number with a library Function parseFloat :: Parsec String s Float (that operates on a String input stream), you have to do something like takeNextToken :: TokenParser String and execute the parseFloat parser on it, inspecting the parse result (usually Either ErrorMessage a). This is messy to write and limits composability.
You have adjust all error messages. If your parser on tokens fails at the 20th token, where in the input string is that? You'll have to manually map error locations back to the input string, which is tedious (in Parsec this means to adjust all SourcePos values).
Error reporting is generally worse. Running string "hello" *> space *> float on wrong input like "hello4" will tell you precisely that there is expected whitespace missing after the hello, while a lexer will just claim to have found an "invalid token".
Many things that one would expect to be atomic units and to be separated by a lexer are actually pretty "too hard" for a lexer to identify. Take for example String literals - suddenly "hello world" are not two tokens "hello and world" anymore (but only, of course, if quotes are not escaped, like \") - while this is very natural for a parser, it means complicated rules and special cases for a lexer.
You cannot re-use parsers on tokens as nicely. If you define how to parse a double out of a String, export it and the rest of the world can use it; they cannot run your (specialized) tokenizer first.
You are stuck with it. When you are developing the language to parse, using a lexer might lead you into making early decisions, fixing things that you might want to change afterwards. For example, imagine you defined a language that contains some Float token. At some point, you want to introduce negative literals (-3.4 and - 3.4) - this might not be possible due to the lexer interpreting whitespace as token separator. Using a parser-only approach, you can stay more flexible, making changes to your language easier. This is not really surprising since a parser is a more complex tool that inherently encodes rules.
To summarize, I would recommend writing lexer-free parsers for most cases.
In the end, a lexer is just a "dumbed-down"* parser - if you need a parser anyway, combine them into one.
* From computing theory, we know that all regular languages are also context-free languages; lexers are usually regular, parsers context-free or even context-sensitve (monadic parsers like Parsec can express context-sensitiveness).

Resources