Is Java Antlr4 grammar somehow finished? - antlr4

I noticed that additiveExpression is never used here: https://github.com/antlr/grammars-v4/blob/master/java/java8/Java8Parser.g4#L1276`
(Used only in itself and in shiftExpression)
Is it by design or Java grammar is far from completeness?

I cannot answer if the Java ANTLR4 grammar is finished (when is a grammar ever finished?) and #kaby76 gave you some information, but I can explain why the additiveExpression rule is only used in shiftExpression and nowhere else. The reason is that this is how complex expressions are defined in ANTLR4 grammars. You start with a top level rule and (depending on the precedence) define sub rules that handle smaller sub parts of an expression, thereby distributing the actual work over multiple sub rules that together handle the entire expression. The rule additiveExpression is such a sub rule, responsible for the addition or subtraction of two further subrules (here multiplicativeExpression).
There's no other other situation where this sub rule would be needed. It makes sense only in expression parsing.

Related

Antlr4 Lexer with two continuous parentheses

My text is : Function(argument, function(argument)). When I use g4 DEBUGGER to generate the tree. It will works when there is blank between the last two parentheses: Function(argument, function(argument) ). But the debugger will say unexpected '))' when there is not a blank. So, how should I revise my grammar to make it?
It confuses me a lot.
(It will be much easier to confirm the answer to your question if you post the grammar, or at least a stripped down version that demonstrates the issue. Note: Often, stripping down to a minimal example to demonstrate the issue will result in you finding your own answer.)
That said, based upon your error message, I would guess that you have a Lexer rule that matches )). If you do, then ANTLR will match that rule and create that token rather than creating two ) tokens.
Most of the time this mistake originates from not understanding that the Lexer is completely independent of the parser. The Lexer recognizes streams of characters and produces a stream of tokens, and then the parser matched rules against that stream of tokens. In this case, I would guess that the parser rule is looking to match a ) token, but finds the )) token instead. Those are two different tokens, so the parser rule will fail.

Repeating Pattern Matching in antlr4

I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?
You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages

Why is antlr4 c grammar parser rule "typeSpecifier" not using lexer rule "Double"?

I am using the antlr4 c grammar as inspiration for my own grammar. I came over one thing I dont really get. Why is there Lexer rules for datatypes when they are not used? For example the rule Double : 'double'; is never used but the parser rule typeSpecifier:('double' | ... );(other datatypes has been removed to simplify) is used several places. Is there a reason why the parser rule typeSpecifier is not using the lexer rule Double?
All the grammars on that page are volunteer submissions and not part of ANTLR4. It's clearly a mistake, but the way lexer rules are matched, it won't make a difference in lexing. You can choose to implement either the explicit rule:
Double : 'double';
or the implicit one:
typeSpecifier
: ('void'
| 'char'
| 'short'
| 'int'
| 'long'
| 'float'
| 'double'
with no ill effects either way, even if you mix methods. In fact, if you take a more global look at that whole grammar, the author did the same thing with numerous other lexer rules, like Register for example. Makes no difference in actual practice.
Bottom line? Choose whichever method you like and apply it consistently. My personal preference is toward brevity, so I like the the implicit tokens so long as they are used in only one place in the grammar. As soon as a token might be used in two places, I prefer to make an explicit token out of it and update the two or more locations where it's used.

Semantic lexer predicate performance

I have a lexer creates MACRO tokens for a dynamic list of macro strings passed to the lexer. I used a semantic predicate in the very top lexer rule to implement this feature:
MACRO: { macros != null && tryMacro() }? .;
Where tryMacro() just checks if any macro string matches the input sequence.
The performance of this approach was very bad and after some research I tried changing the lexer rule to the following:
MACRO: . { macros != null && tryMacro() }?;
This severely improved the performance but I don't really understand why. :) Since the '.' matches any character, the semantic predicate rule should be invoked exactly as many times as before, shouldn't it? Can someone provide an explanation for this behavior?
The reason is pretty simple: if you put the predicate at the start, the lexer will evaluate it to decide if the MACRO rule should apply. If you put it at the end, it will only perform the check when it has a potential match for the MACRO rule.
Since MACRO is very generic, I suppose you put it at the end of the rules, and due to the priority rules it will surely get tried last. It can match only single character tokens, so more precise rules will be prioritary.
If the MACRO rule is superseded by a more prioritary rule, it won't be considered and your predicate won't be invoked.
I debugged this a bit further and it turned out that the reordering of the rule changed the behavior of the lexer causing macros to not be accepted during parsing. The reason for the perceived increase in performance was because the semantic predicate was only evaluated a couple of times before the lexer dropped the rule while doing its predictions. So the change of the rule was actually invalid and not a performance improvement.
I finally solved the performance issue by moving the macro handling to the parser.

ANTLR4 Parser Rule granularity

Say I have a grammar that has a parser rule to match one of several specific strings.
Is the proper thing to do in the grammar to make an alternate parser rule for each specific string, or to keep the parser rule general and decode the string in a visitor subclass?
If the specific strings are meaningful (e.g. a keyword in a DSL) it sounds like you want Tokens. Whatever rules you have in the grammar can reference the Tokens you created.
Generally, it's better to have your grammar do as much of the parser work as possible, rather than overly generalizing and having to write a bunch of extra code.
See the following: http://www.antlr.org/wiki/display/ANTLR4/Grammar+Structure

Resources