Grammar:
grammar Test;
file: (procDef | statement)* EOF;
procDef: 'procedure' ID NL statement+ ;
statement: 'statement'? NL;
WS: (' ' | '\t') -> skip;
NL: ('\r\n' | '\r' | '\n');
ID: [a-zA-Z0-9]+;
Test data:
statement
procedure Proc1
statement
statement
The parser does what I want (i.e. statement+ is greedy), but it reports an ambiguity because it doesn't know whether the last statement belongs to procDef or file (as I understand it).
As predicates are language dependent I'd prefer not to use one.
The procedure is supposed to end when a statement that can't belong to it, such as 'procedure', occurs.
I also would prefer to have the statements bound to the procedure to avoid having to rearrange the structure later.
Edit
It seems I should expand my test data a bit (but I will leave the original as it is small and shows the ambiguity I want to solve).
I want to be able to handle situations like this:
statement
procedure Proc1
statement
statement
procedure Proc2
statement
statement
procedure Proc2a
statement
statement
global
statement
procedure Proc3
statement
statement
(The indentation is not significant.) I can do it without predicates with something like
file: (
commonStatement
| globalStatement
)* EOF;
procDef: 'procedure' ID NL commonStatement+ ;
commonStatement: 'statement'? NL;
globalStatement: 'global' NL | procDef (globalStatement | EOF);
but then the tree becomes deeper with each consecutive procDef, and that feels very undesirable.
Then a solution with predicates is actually preferable.
#parser::members { boolean inProc; }
file: (
{!inProc}? commonStatement
| globalStatement
)* EOF;
procDef: 'procedure' ID {inProc = true;} NL commonStatement+ ;
commonStatement: 'statement'? NL;
globalStatement: ('global' NL {inProc = false;} | procDef) ;
The situation is actually worse than this, as globally acessible commonStatements can occur without an intervening globalStatement (accessible through gotos), but there is no way a parser can distinguish between that and statements belonging to the procedure, so my plan was to just discourage such use (and I don't think it's common). In fact, it is perfectly legal to jump into procedure code as well ...
It may turn out that in the end I will have to examine runtime paths anyway (scope is very much determined at runtime), and the grammar might end up something like
file: (
commonStatement
| globalStatement
| procDef
)* EOF;
procDef: 'procedure' ID NL procStatement*;
commonStatement: 'statement'? NL;
procStatement: 'proc' NL;
globalStatement: 'global' NL;
We will see ...
By your criteria, it is impossible for a statement to follow a procDef. You are well within your rights to design a language that way, but I hope you have an answer ready for the FAQ "How do I write a statement which comes after a procedure definition."
Writing the grammar is the easy part:
file: statement* procDef* EOF;
Related
I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.
You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;
In a previous question with a simple grammar, I learned to handle IDs that can include keywords from a keyword list. My actual grammar is a little more complex: there are several lists of keywords that are expected in different types of sentences. Here's my attempt at a simple grammar that tells the story:
grammar Hello;
file : ( fixedSentence )* EOF ;
sentence : KEYWORD1 ID+ KEYWORD2 ID+ PERIOD
| KEYWORD3 ID+ KEYWORD4 ID+ PERIOD;
KEYWORD1 : 'hello' | 'howdy' | 'hi' ;
KEYWORD2 : 'bye' | 'goodbye' | 'adios' ;
KEYWORD3 : 'dear' | 'dearest' ;
KEYWORD4 : 'love' | 'yours' ;
PERIOD : '.' ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
So the sentences I want to match are, for example:
hello snickers bar goodbye mars bar.
dear peter this is fun yours james.
and that works great. But I also want to match sentences that contain keywords that would not be expected to terminate the ID+ block. For example
hello hello kitty goodbye my dearest raggedy ann and andy.
hello fist appears as KEYWORD1 and then just following as part of that first ID+. Following the example of the above linked question, I can fix it like this:
// ugly solution:
fixedSentence : KEYWORD1 a=(ID|KEYWORD1|KEYWORD3|KEYWORD4)+ KEYWORD2 b=(ID|KEYWORD1|KEYWORD2|KEYWORD3|KEYWORD4)+ PERIOD
| KEYWORD3 a=(ID|KEYWORD1|KEYWORD2|KEYWORD3)+ KEYWORD4 b=(ID|KEYWORD1|KEYWORD2|KEYWORD3|KEYWORD4)+ PERIOD;
which works and does exactly what I'd like. In my real language, I've got hundreds of keyword lists, to be used in different types of sentences, so if I try for this approach, I'll certainly make a mistake doing this, and when I create new structures in my language, I have to go back and edit all the others.
What would be nice is to do non-greedy matching from a list, following the ANTLR4 book's examples for comments. So I tried this
// non-greedy matching concept:
KEYWORD : KEYWORD1 | KEYWORD2 | KEYWORD3 | KEYWORD4 ;
niceID : ( ID | KEYWORD ) ;
niceSentence : KEYWORD1 niceID+? KEYWORD2 niceID+? PERIOD
| KEYWORD2 niceID+? KEYWORD3 niceID+? PERIOD;
which I think follows the model for comments (e.g. given on p.81 of the book):
COMMENT : '/*' .*? '*/' -> skip ;
by using the ? to suggest non-greediness. (Though the example is a lexer rule, does that change the meaning here?) fixedSentence works but niceSentence is a failure. Where do I go from here?
To be specific, the errors reported in parsing the hello kitty test sentence above are,
Testing rule sentence:
line 1:6 extraneous input 'hello' expecting ID
line 1:29 extraneous input 'dearest' expecting {'.', ID}
Testing rule fixedSentence: no errors.
Testing rule niceSentence:
line 1:6 extraneous input 'hello' expecting {ID, KEYWORD}
line 1:29 extraneous input 'dearest' expecting {KEYWORD2, ID, KEYWORD}
line 1:57 extraneous input '.' expecting {KEYWORD2, ID, KEYWORD}
And if it helps to see the parse trees, here they are.
Recognize that the parser is ideally suited to handling syntax, i.e., structure, and not semantic distinctions. Whether a keyword is an ID terminator in one context and not in another, both being syntactically equivalent, is inherently semantic.
The typical ANTLR approach to handling semantic ambiguities is to create a parse tree recognizing as many structural distinctions as reasonably possible, and then walk the tree analyzing each node in relation to the surrounding nodes (in this case) to resolve ambiguities.
If this resolves to your parser being
sentences : ( ID+ PERIOD )* EOF ;
then your sentences are essentially free form. The more appropriate tool might be an NLP library - Stanford has a nice one.
Additional
If you define your lexer rules as
KEYWORD1 : 'hello' | 'howdy' | 'hi' ;
KEYWORD2 : 'bye' | 'goodbye' | 'adios' ;
KEYWORD3 : 'dear' | 'dearest' ;
KEYWORD4 : 'love' | 'yours' ;
. . . .
KEYWORD : KEYWORD1 | KEYWORD2 | KEYWORD3 | KEYWORD4 ;
the lexer will never emit a KEYWORD token - 'hello' is consumed and emitted as a KEYWORD1 and the KEYWORD rule is never evaluated. Since the parse tree fails to identify the type of the tokens (apparently) it is not very illuminating. Dump the token stream to see what the lexer is actually doing
hello hello kitty goodbye my dearest ...
KEYWORD1 KEYWORD1 ID KEYWORD2 ID KEYWORD3 ...
If you place the KEYWORD rule before the others, then the lexer is going to only emit KEYWORD tokens.
Changing to parser rules
niceID : ( ID | keyword ) ;
keyword : KEYWORD1 | KEYWORD2 | KEYWORD3 | KEYWORD4 ;
will allow this very limited example to work.
I am parsing a SQL like language.
I want to parse full SQL sentence like : SELECT .. FROM .. WHERE and also a simple expr line which can be a function, where clause, expr and arithmethics.
this is the important parts of the grammar:
// main rule
parse : (statments)* EOF;
// All optional statements
statments : select_statement
| virtual_column_statement
;
select_statement :
SELECT select_item ( ',' select_item )*
FROM from_cluase ( ',' from_cluase )*
(WHERE where_clause )?
( GROUP BY group_by_item (',' group_by_item)* )?
( HAVING having_condition (AND having_condition)* )?
( ORDER BY order_clause (',' order_clause)* )?
( LIMIT limit_clause)?
| '(' select_statement ')'
virtual_column_statement:
virtual_column_expression
;
virtual_column_expression :
expr
| where_clause
| function
| virtual_column_expression arithmetichOp=('*'|'/'|'+'|'-'|'%') virtual_column_expression
| '(' virtual_column_expression ')'
;
virtual_columns works great.
Select queries works too but after it finishes, it goes to virtual_column_statement too.
I want it to choose one.
How can I fix this?
EDIT:
After some research I found out antlr takes my query and seperate it to two different parts.
How can I fix this?
Thanks,
id
Your 'virtual_column_statement' appears to be part of the 'select_statement'. I expect that you are missing a ';' between the two rules.
Most of your 'select_statement' clauses are optional, so after matching the select and from clauses, if Antlr thinks that the balance of the input is better matched as a 'virtual_column_statement', then it will take that path.
Your choices are:
1) make your select_statement comprehensive and at least as general as your 'virtual_column_statement';
2) require a keyword at the beginning of the 'virtual_column_statement' to prevent Antlr from considering it as a partial alternate;
3) put the 'virtual_column_statement' in a separate parser grammar and don't send it any select input text.
I am writing an ANTLR 4 grammar for a language that will have switch statements that do not allow fallthrough (similar to C#). All case statements must be terminated by a break statement. Multiple case statements can follow each other without any code in between (again, just like in C#). Here is a snippet of the grammar that captures this:
grammar MyGrammar;
switchStmt : 'switch' '(' expression ')' '{' caseStmt+ '}' ;
caseStmt : (caseOpener)+ statementList breakStmt ;
caseOpener : 'case' literal ':'
| 'default' ':'
;
statementList : statement (statement)* ;
breakStmt : 'break' ';' ;
I left out the definitions of expression and statement for brevity. However, it's important to note that the definition for statement includes breakStmt. This is because break statements can also be used to break out of loops.
In general the grammar is fine - it parses input as expected. However, I get warnings during the parse like "line 18:0 reportAttemptingFullContext d=10 (statementList), input='break;" and "line 18:0 reportContextSensitivity d=10 (statementList), input='break;" This makes sense because the parser is not sure whether to match a break statement as statement or as breakStmt and needs to fall back on ALL(*) parsing. My question is, how can I change my grammar in order to eliminate the need for this during the parse and avoid the performance hit? Is it even possible to do without changing the language syntax?
You should remove the breakStmt reference from the end of caseStmt, and instead perform this validation in a listener or visitor after the parse is complete. This offers you the following advantages:
Improved error handling when a user omits the required break statement.
Improved parser performance by removing the ambiguity between the breakStmt at the end of caseStmt and the statementList that precedes it.
I would use the following rules:
switchStmt
: 'switch' '(' expression ')' '{' caseStmt* '}'
;
caseStmt
: caseOpener statementList?
;
statementList
: statement+
;
How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.