Antlr4 Lexer with two continuous parentheses - antlr4

My text is : Function(argument, function(argument)). When I use g4 DEBUGGER to generate the tree. It will works when there is blank between the last two parentheses: Function(argument, function(argument) ). But the debugger will say unexpected '))' when there is not a blank. So, how should I revise my grammar to make it?
It confuses me a lot.

(It will be much easier to confirm the answer to your question if you post the grammar, or at least a stripped down version that demonstrates the issue. Note: Often, stripping down to a minimal example to demonstrate the issue will result in you finding your own answer.)
That said, based upon your error message, I would guess that you have a Lexer rule that matches )). If you do, then ANTLR will match that rule and create that token rather than creating two ) tokens.
Most of the time this mistake originates from not understanding that the Lexer is completely independent of the parser. The Lexer recognizes streams of characters and produces a stream of tokens, and then the parser matched rules against that stream of tokens. In this case, I would guess that the parser rule is looking to match a ) token, but finds the )) token instead. Those are two different tokens, so the parser rule will fail.

Related

antlr4 handling incomplete rule match because of parse error in visitor

I'm new to antlr4 and I'm trying to make good use of antlr's ability to recover from parser errors and proceed. I find that it can proceed to visit the parse tree even when there has been a parse error and it will match a rule but sometimes not all of the rule elements are there. This causes a problem in my visitor code because my code is expecting all elements of the rule match to be there and it throws an exception.
Two options I'm thinking of:
1) after parsing, check parser.getNumberOfSyntaxErrors() > 1 and don't proceed to visit the parse tree if so. This will stop an exception being thrown, but does not give the user as good feedback as possible. antlr4 does recover nicely from errors and can get to the next independent section of what I'm trying to parse so this is stronger than what I'd like.
2) I can wrap each self.visit() in something that will catch
an exception and react accordingly. I think this would work.
But, I'm wondering if there is something in ctx or otherwise that would tell me that what's below it in the parse tree is an incomplete match?
In case it is relevant, I'm using python with antlr 4.
As you have seen ANTLR4 tries to re-synchronize the input stream to the rule structure once an error was encountered. This is usually done by trying to detect a single missing token or a single additional token. Everything else usually leads to error nodes all the way to the end of the input.
Of course, if the input cannot be parsed successfully the parse tree will be incomplete, at least from the point of the error, which might be much earlier than where the actual error is located. This happens because of variable lookahead, which may consume much of the input to find a prediction early in the parsing process (and hence this can fail early).
In fact I recommend to follow path 1). Once you got a syntax error there's not much in the parse tree you can use. It's totally up to the grammar structure what part will be parsed successfully (don't assume it would always be everything up to the error position, as I just explained).

Elegant way to parse "line splices" (backslashes followed by a newline) in megaparsec

for a small compiler project we are currently working on implementing a compiler for a subset of C for which we decided to use Haskell and megaparsec. Overall we made good progress but there are still some corner cases that we cannot correctly handle yet. One of them is the treatment of backslashes followed by a newline. To quote from the specification:
Each instance of a backslash character () immediately followed by a
new-line character is deleted, splicing physical source lines to form
logical source lines. Only the last backslash on any physical source
line shall be eligible for being part of such a splice.
(ยง5.1.1., ISO/IEC9899:201x)
So far we came up with two possible approaches to this problem:
1.) Implement a pre-lexing phase in which the initial input is reproduced and every occurence of \\\n is removed. The big disadvantage we see in this approach is that we loose accurate error locations which we need.
2.) Implement a special char' combinator that behaves like char but looks an extra character ahead and will silently consume any \\\n. This would give us correct positions. The disadvantage here is that we need to replace every occurence of char with char' in any parser, even in the megaparsec-provided ones like string, integer, whitespace etc...
Most likely we are not the first people trying to parse a language with such a "quirk" with parsec/megaparsec, so I could imagine that there is some nicer way to do it. Does anyone have an idea?

Why is antlr4 c grammar parser rule "typeSpecifier" not using lexer rule "Double"?

I am using the antlr4 c grammar as inspiration for my own grammar. I came over one thing I dont really get. Why is there Lexer rules for datatypes when they are not used? For example the rule Double : 'double'; is never used but the parser rule typeSpecifier:('double' | ... );(other datatypes has been removed to simplify) is used several places. Is there a reason why the parser rule typeSpecifier is not using the lexer rule Double?
All the grammars on that page are volunteer submissions and not part of ANTLR4. It's clearly a mistake, but the way lexer rules are matched, it won't make a difference in lexing. You can choose to implement either the explicit rule:
Double : 'double';
or the implicit one:
typeSpecifier
: ('void'
| 'char'
| 'short'
| 'int'
| 'long'
| 'float'
| 'double'
with no ill effects either way, even if you mix methods. In fact, if you take a more global look at that whole grammar, the author did the same thing with numerous other lexer rules, like Register for example. Makes no difference in actual practice.
Bottom line? Choose whichever method you like and apply it consistently. My personal preference is toward brevity, so I like the the implicit tokens so long as they are used in only one place in the grammar. As soon as a token might be used in two places, I prefer to make an explicit token out of it and update the two or more locations where it's used.

Semantic lexer predicate performance

I have a lexer creates MACRO tokens for a dynamic list of macro strings passed to the lexer. I used a semantic predicate in the very top lexer rule to implement this feature:
MACRO: { macros != null && tryMacro() }? .;
Where tryMacro() just checks if any macro string matches the input sequence.
The performance of this approach was very bad and after some research I tried changing the lexer rule to the following:
MACRO: . { macros != null && tryMacro() }?;
This severely improved the performance but I don't really understand why. :) Since the '.' matches any character, the semantic predicate rule should be invoked exactly as many times as before, shouldn't it? Can someone provide an explanation for this behavior?
The reason is pretty simple: if you put the predicate at the start, the lexer will evaluate it to decide if the MACRO rule should apply. If you put it at the end, it will only perform the check when it has a potential match for the MACRO rule.
Since MACRO is very generic, I suppose you put it at the end of the rules, and due to the priority rules it will surely get tried last. It can match only single character tokens, so more precise rules will be prioritary.
If the MACRO rule is superseded by a more prioritary rule, it won't be considered and your predicate won't be invoked.
I debugged this a bit further and it turned out that the reordering of the rule changed the behavior of the lexer causing macros to not be accepted during parsing. The reason for the perceived increase in performance was because the semantic predicate was only evaluated a couple of times before the lexer dropped the rule while doing its predictions. So the change of the rule was actually invalid and not a performance improvement.
I finally solved the performance issue by moving the macro handling to the parser.

ANTLR4 Parser Rule granularity

Say I have a grammar that has a parser rule to match one of several specific strings.
Is the proper thing to do in the grammar to make an alternate parser rule for each specific string, or to keep the parser rule general and decode the string in a visitor subclass?
If the specific strings are meaningful (e.g. a keyword in a DSL) it sounds like you want Tokens. Whatever rules you have in the grammar can reference the Tokens you created.
Generally, it's better to have your grammar do as much of the parser work as possible, rather than overly generalizing and having to write a bunch of extra code.
See the following: http://www.antlr.org/wiki/display/ANTLR4/Grammar+Structure

Resources