Is there a parser equivalent of 'fragment' marking in ANTLR4? - antlr4

Is there a way to tell ANTLR4 to inline the parser rule?
It seems reasonable to have such feature. After reading the book on ANTLR ("The Definitive ANTLR 4 Reference") I haven't found such possibility, but changes might've been introduced in the 4 years
since the book was released, so I guess it is better to ask here.
Consider the following piece of grammar:
file: ( item | class_decl )*;
class_decl: 'class' class_name '{' type_decl* data_decl* code_decl* '}';
type_decl: 'typedef' ('bool'|'int'|'real') type_name;
const_decl: 'const' type_name const_name;
var_decl: 'var' type_name var_name;
...
fragment item: type_decl | data_decl | code_decl;
fragment data_decl: const_decl | var_decl;
fragment code_decl: function_decl | procedure_decl;
fragment class_name: ID;
fragment type_name: ID;
fragment const_name: ID;
fragment var_name: ID;
The rules marked as fragment are there for clarity/documentation and reusability, however from syntax point of view it is f.e. really a var_decl that is actual direct element of file or class_decl and I'd like to have it reflected in content of contexts created by the parser. All the intermediate contexts created for item, data_decl etc. are superfluous, needlessly take space and make it so visitor is bound to organizational structure of the grammar instead of its actual meaning.
To sum up - I'd expect ANTLR to turn the above grammar into the following before generation of a parser:
file: ( type_decl | const_decl | var_decl | function_decl | procedure_decl | class_decl )*;
class_decl: 'class' ID '{' type_decl* ( const_decl | var_decl )* ( function_decl | procedure_decl )* '}';
type_decl: 'typedef' ('bool'|'int'|'real') ID;
const_decl: 'const' ID ID;
var_decl: 'var' ID ID;
...

No, there is no such thing in parser rules. You could raise an issue/RFE in ANTLRs Github repo for such a thing: https://github.com/antlr/antlr4/issues

You can use rule element labels. They provide the similar functionality but more restricted (applicatble for only single token or rule):
file: ( item | class_decl )*;
class_decl: 'class' class_name=ID '{' type_decl* data_decl* code_decl* '}';
type_decl: 'typedef' ('bool'|'int'|'real') type_name=ID;
const_decl: 'const' type_name=ID const_name=ID;
var_decl: 'var' type_name=ID var_name=ID;
...
item: type_decl | data_decl | code_decl;
data_decl: const_decl | var_decl;
code_decl: function_decl | procedure_decl;

Related

Mixing two languages

I am writing a grammar for a small meta language. That language should include code blocks of another language (e.g., JavaScript, C, or the like). I would like to treat these code blocks just a plain strings that are print out unchanged. My language is C/Java syntax based using { } for code blocks. But I would also like to use { } for the code blocks of the embedded language. Here some example code:
// my language
modul Abc {
input x: string;
otherLang {
// this is now a code block from the second
// language, which I do not want to analyze
// It might itself contain { } like
if (something) {
abc = "string";
}
}
}
How would I resuse { and } for those different uses without mixing them up with the ones from an embedded language?
An interesting way to do this is to use mode recursion. ANTLR internally maintains a mode stack.
Although a bit verbose, the recursed mode offers the possibility of handling things -- like comments and escaped chars -- that could otherwise throw off the nesting.
One thing to be aware of is that rules with more attributes concatenate their matched content into the token produced by the first following non-moreed rule. The following example uses the virtual token OTHER_END to provide semantic clarity and preclude confusion with otherwise being a RPAREN token.
tokens {
OTHER_END
}
otherLang : OTHER_BEG OTHER_END+ ; // multiple 'end's dependent on nesting
OTHER_BEG : 'otherLang' LPAREN -> pushMode(Other) ;
LPAREN : LParen ;
RPAREN : RParen ;
WS : [ \t\r\n] -> skip;
mode Other ;
// handle special cases here
O_RPAREN : RParen -> type(OTHER_END), popMode() ;
O_LPAREN : LParen -> more, pushMode(Other) ;
O_STUFF : . -> more ;
fragment LParen : '{' ;
fragment RParen : '}' ;

How to define a token which is all those characters in set A, except those in sub-set B?

In RFC2616 (HTTP/1.1) the definition of a 'token' in section '2.2 Basic Rules' is given as:
token = 1*<any CHAR except CTLs or separators>
From that section, I've got the following fragments, and now I want to define 'TOKEN':
lexer grammar AcceptEncoding;
TOKEN: /* (CHAR excluding (CTRL | SEPARATORS)) */
fragment CHAR: [\u0000-\u007f];
fragment CTRL: [\u0000-\u001f] | \u007f;
fragment SEPARATORS: [()<>#,;:\\;"/\[\]?={|}] | SP | HT;
fragment SP: ' ';
fragment HT: '\t';
How do I approximate my hypothetical 'excluding' operator for the definition of TOKEN?
There is no set/range math in ANTLR. You can only combine several sets/ranges via the OR operator. A typical rule for a number of disjoint ranges looks like:
fragment LETTER_WHEN_UNQUOTED:
'0'..'9'
| 'A'..'Z'
| '$'
| '_'
| '\u0080'..'\uffff'
;
One approach is to 'do the math' on set of characters, so that we can define lexical rules which only ever combine characters:
lexer grammar RFC2616;
TOKEN: (DIGIT | UPALPHA | LOALPHA | NON_SEPARATORS)+
/*
* split up ASCII 0-127 into 'atoms' of
* relevance per '2.2 Basic Rules'. Regions
* not requiring to be referenced are not
* given a name.
*/
// [\u0000-\u0008]; /* (control chars) */
fragment HT: '\u0009'; /* (tab) */
fragment LF: '\u0010'; /* (LF) */
// [\u0011-\u0012]; /* (control chars) */
fragment CR: '\u0013'; /* (CR)
// [\u0014-\u001f]; /* (control chars) */
fragment SP: '\u0020'; /* (space) */
// [\u0021-\u02f]; /* !"#$%'()*+,-./ */
fragment DIGIT: [u0030-\u0039]; /* 01234556789 */
// [\u003a-\u0040]; /* :;<=># */
fragment UPALPHA: [\u0041-\u005a]; /* ABCDEFGHIGJLMNOPQRSTUVWXYZ */
// [\u005b-\u0060]; /* [\]^_` */
fragment LOALPHA: [\u0061-\u0071]; /* abcdefghijklmnopqrstuvwxyz */
// [\u007b-\u007e]; /* {|}~ */
// '\u007f'; /* (del) */
/*
* Considering 'all relevant gaps' and the characters we
* cannot use per RFC 2616 Section 2.2 Basic Rules definition
* of 'separators', what does that leave us with?
* (manually determined)
*/
fragment SEPARATORS: [()<>#,;:\\;"/\[\]?={|}];
fragment NON_SEPARATORS: [!#$%&'*+-.^_`~*];
I don't find this approach especially satisfying. Another rule in RFC 2616 wants to be defined like:
TEXT: <any OCTET except CTLs, but including LWS>
qdtext = <any TEXT except <">>
This would force me to further refactor up my expedient 'SEPARATORS' token, above, like:
fragment QUOT: '"';
fragment SEPARATORS_OTHER_THAN_QUOT: [()<>#,;:\\;/\[\]?={|}];
fragment SEPARATORS: SEPARATORS_OTHER_THAN_QUOT | QUOT;
fragment LWS: SP | HT;
TEXT: DIGIT | UPALPHA | LOALPHA | LWS | SEPARATORS | NON_SEPARATORS;
QDTEXT: DIGIT | UPALPHA | LOALPHA | LWS | SEPARATORS_OTHER_THAN_QUOT | NON_SEPARATORS;
Perhaps this is part of the work of writing a lexer, and can't be avoided, but it feels more like solving the problem the wrong way!
(NB: I won't be marking this answer as 'correct'.)
Spurred on by the answer from #mike-lischke (because LETTER_WHEN_UNQUOTED really felt wrong still), I hunted for the surely-common treatment of quoted string literals in other grammars. In Terrence Parr's own Java 1.6 ANTLR3 grammar (er, not properly served as text/plain) (via ANTLR3 Grammar List), he reaches for a 'match any character other than' tilde-operator ~ in a lexer rule:
STRINGLITERAL
: '"'
( EscapeSequence
| ~( '\\' | '"' | '\r' | '\n' )
)*
'"'
;
// Copyright (c) 2007-2008 Terence Parr and possibly Yang Jiang.
NOTE: the above code is licenced under a BSD licence, but I am not re-distributing this fragment under the BSD license (since this post itself is under CC-BY-SA). Instead, I am using it within the terms of 'fair use' as I understand them.
So the ~ gives me an option to express: 'all those characters in Unicode, except those in Set B'. "Annoying I don't get to choose the set which is excluded from", I thought. But then I realised
TOOHIGH: [\u007f-\uffff];
TOKEN: (~( TOOHIGH | SP | HT | CTRL | SEPARATORS ))+
... should be fine. Although, in practice, ANTLR4 doesn't 'like' lexer sub-rules appearing in 'sets', and only handles sets of literals, so that ultimately becomes:
TOKEN:
/* this is given in '2.2 Basic Rules' as:
*
* token = 1*<any CHAR except CTLs or separators>
*
* which I am reducing down to:
* any character in ASCII 0-127 but _excluding_
* CTRL (0-31,127)
* SEPARATORS
* space (32)
* and tab (9) (which is a CTRL character anyhow)
*/
( ~( [\u0000-\u001f] | '\u007f' /*CTRL,HT*/ | [()<>#,;:\\;"/\[\]?={|}] /*SEPARATORS*/ | '\u0020' /*SP*/ | [\u0080-\uffff] /*NON_ASCII*/))*
;
The trick was expressing including the set I do want (Unicode 0-127) in terms of excluding the set I don't want (Unicode 128+).
This is much more succinct than my other answer. If it actually works, I'll mark it as correct.

Errors when compiling GLR parsers from Happy - 'parse error on input ‘case’'

I have tried multiple example grammars and get the same error when I try to compile the generated files.
For example I have followed exactly the solution to this question - GLR_Lib.hs: Could not find module 'System'
where the grammar file is
%tokentype { ABC }
%error { parseError }
%token
a { A }
b { B }
c { C }
%%
s1 : a a a b {} | b s2 a {}
s2 : b a b s2 {} | c {}
{
data ABC = A | B | C deriving (Eq,Ord,Show)
parseError _ = error "bad"
}
But when I compile I get:
[1 of 2] Compiling ABCData ( ABCData.hs, ABCData.o )
[2 of 2] Compiling ABC ( ABC.hs, ANC.o )
GLR_Lib.hs:164:2: parse error on input ‘case’
This exact error has happened with every grammar that I have tried. I don't know what I could be doing differently to people that have the examples working successfully.
There are indentation errors in the GLR_Lib template. This is what I did to get it to work:
Create the ABCMain.hs file.
Create a new directory ./templates for the edited templates.
Find the originals - e.g. use locate GLR_Lib. On OSX with the Haskell Platform I found them in /Library/Haskell/current/share/happy-1.19.4/
Copy all of the templates to ./templates
Make the following edits to ./templates/GLR_Lib:
line 44: comment out import System
line 161: replace the leading space with a tab: case new_stks of
line 190: replace the leading space with a tab: stks' <- foldM (pack i) stks reds
Run: happy --glr --template=./templates ABC.y
Compile with: ghc --make ABCMain
You will probably only need the GLR_Lib and GLR_Base templates.

How to check if ID is valid during parsing - ParseTreeListener enter event not called

My grammar contains the following:
assignment
: ID ASSIGN expr
;
expr
: MINUS expr #unaryMinusExpr
| NOT expr #notExpr
| expr MULT expr #multExpr
| expr DIV expr #divExpr
| expr PLUS expr #plusExpr
| expr MINUS expr #minusExpr
| expr LTEQ expr #lteqExpr
| expr GTEQ expr #gteqExpr
| expr LT expr #ltExpr
| expr GT expr #gtExpr
| expr NEQ expr #neqExpr
| expr EQ expr #eqExpr
| expr AND expr #andExpr
| expr OR expr #orExpr
| atom #atomExpr
;
atom
: OPAR expr CPAR #parExpr
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| STRING #stringAtom
| ID #idAtom
;
ID
: [a-zA-Z_] [a-zA-Z_0-9]*
;
The ID here represents an entry (row) in a database which the user refers to by, well :), the ID. So when parsing the formula, I'd like to check that they entered a valid ID.
From what I can tell, the way to go is to have a ParseTreeListener that overrides EnterIdAtom so I can throw an RecognitionException. So I hooked that up, but the Enter event is never called.
class MyListener : BaseListener
{
public override EnterIdAtom(IdAtomContext context)
{
if (!CheckForValidId(context.ID().GetText())
{
throw new RecognitionException(...)
}
}
}
Not sure why?
Is there a better way of doing this?
Thanks.
It sounds like you are using Parser.addParseListener (bold font added by me):
Registers listener to receive events during the parsing process.
To support output-preserving grammar transformations (including but not limited to left-recursion removal, automated left-factoring, and optimized code generation), calls to listener methods during the parse may differ substantially from calls made by ParseTreeWalker.DEFAULT used after the parse is complete. In particular, rule entry and exit events may occur in a different order during the parse than after the parser. In addition, calls to certain rule entry methods may be omitted.
With the following specific exceptions, calls to listener events are deterministic, i.e. for identical input the calls to listener methods will be the same.
Alterations to the grammar used to generate code may change the behavior of the listener calls.
Alterations to the command line options passed to ANTLR 4 when generating the parser may change the behavior of the listener calls.
Changing the version of the ANTLR Tool used to generate the parser may change the behavior of the listener calls.
If this is the case, you should be using ParseTreeWalker to walk the tree after parsing is complete, instead of trying to mix the two operations together.
Have you set a breakpoint at:
public override void enterEveryRule(ParserRuleContext ctx)
or
public override void enterAtom(AtomContext ctx)
This should give you a clue if the rule is even invoked.
Otherwise try this:
atom
: OPAR expr CPAR #parExpr
| (INT | FLOAT) #numberAtom
| (TRUE | FALSE) #booleanAtom
| STRING #stringAtom
| atomId // changed!
;
atomId
: ID
;
This will generate a new "real" context, which you can visit:
public override void enterAtom(AtomIdContext ctx) {}

Antlr4 left-recursive rule appears to produce right-associative parse

The following grammar illustrates the issue:
// test Antlr4 left recursion associativity
grammar LRA;
#parser::members {
public static void main(String[] ignored) throws Exception{
final LRALexer lexer = new LRALexer(new ANTLRInputStream(System.in));
final LRAParser parser = new LRAParser(new CommonTokenStream(lexer));
parser.setTrace(true);
parser.file();
}
}
ID: [A-Za-z_] ([A-Za-z_]|[0-9])*;
CMA: ',';
SMC: ';';
UNK: . -> skip;
file: punctuated EOF;
punctuated
: punctuated cma punctuated
| punctuated smc punctuated
| expression
;
cma: CMA;
smc: SMC;
expression: id;
id: ID;
Given input "a,b,c" i get listener event trace output
( 'a' ) ( ',' ( 'b' ) ( ',' ( 'c' ) ) )
where ( represents enter punctuated, ) represents exit punctuated, and all other rules are omitted for brevity and clarity.
By inspection, this order of listener events represents a right-associative parse.
Common practice, and The Definitive Antlr 4 Reference, lead me to expect a left-associative parse, corresponding to the following listener event trace
( 'a' ) ( ',' ( 'b' ) ) ( ',' ( 'c' ) )
Is there something wrong with my grammar, my expectations, my interpretation of the listener events, or something else?
I would consider the workaround described above to be an adequate answer. The generated parser needs to pass a precedence parameter to a recursive call, and since the precedence is associated with a token, the token has to be directly available in the recursive rule so Antlr can find its precedence.
The working grammar looks like this:
// test Antlr4 left recursion associativity
grammar LRA;
#parser::members {
public static void main(String[] ignored) throws Exception{
final LRALexer lexer = new LRALexer(new ANTLRInputStream(System.in));
final LRAParser parser = new LRAParser(new CommonTokenStream(lexer));
parser.setTrace(true);
parser.file();
}
}
ID: [A-Za-z_] ([A-Za-z_]|[0-9])*;
CMA: ',';
SMC: ';';
UNK: . -> skip;
file: punctuated EOF;
punctuated
: punctuated CMA punctuated
| punctuated SMC punctuated
| expression
;
expression: id;
id: ID;

Resources