Non-greedy matching of tokens from lists in ANTLR4

Non-greedy matching of tokens from lists in ANTLR4 - antlr4

In a previous question with a simple grammar, I learned to handle IDs that can include keywords from a keyword list. My actual grammar is a little more complex: there are several lists of keywords that are expected in different types of sentences. Here's my attempt at a simple grammar that tells the story:
grammar Hello;
file : ( fixedSentence )* EOF ;
sentence : KEYWORD1 ID+ KEYWORD2 ID+ PERIOD
| KEYWORD3 ID+ KEYWORD4 ID+ PERIOD;
KEYWORD1 : 'hello' | 'howdy' | 'hi' ;
KEYWORD2 : 'bye' | 'goodbye' | 'adios' ;
KEYWORD3 : 'dear' | 'dearest' ;
KEYWORD4 : 'love' | 'yours' ;
PERIOD : '.' ;
ID : [a-z]+ ;
WS : [ \t\r\n]+ -> skip ;
So the sentences I want to match are, for example:
hello snickers bar goodbye mars bar.
dear peter this is fun yours james.
and that works great. But I also want to match sentences that contain keywords that would not be expected to terminate the ID+ block. For example
hello hello kitty goodbye my dearest raggedy ann and andy.
hello fist appears as KEYWORD1 and then just following as part of that first ID+. Following the example of the above linked question, I can fix it like this:
// ugly solution:
fixedSentence : KEYWORD1 a=(ID|KEYWORD1|KEYWORD3|KEYWORD4)+ KEYWORD2 b=(ID|KEYWORD1|KEYWORD2|KEYWORD3|KEYWORD4)+ PERIOD
| KEYWORD3 a=(ID|KEYWORD1|KEYWORD2|KEYWORD3)+ KEYWORD4 b=(ID|KEYWORD1|KEYWORD2|KEYWORD3|KEYWORD4)+ PERIOD;
which works and does exactly what I'd like. In my real language, I've got hundreds of keyword lists, to be used in different types of sentences, so if I try for this approach, I'll certainly make a mistake doing this, and when I create new structures in my language, I have to go back and edit all the others.
What would be nice is to do non-greedy matching from a list, following the ANTLR4 book's examples for comments. So I tried this
// non-greedy matching concept:
KEYWORD : KEYWORD1 | KEYWORD2 | KEYWORD3 | KEYWORD4 ;
niceID : ( ID | KEYWORD ) ;
niceSentence : KEYWORD1 niceID+? KEYWORD2 niceID+? PERIOD
| KEYWORD2 niceID+? KEYWORD3 niceID+? PERIOD;
which I think follows the model for comments (e.g. given on p.81 of the book):
COMMENT : '/*' .*? '*/' -> skip ;
by using the ? to suggest non-greediness. (Though the example is a lexer rule, does that change the meaning here?) fixedSentence works but niceSentence is a failure. Where do I go from here?
To be specific, the errors reported in parsing the hello kitty test sentence above are,
Testing rule sentence:
line 1:6 extraneous input 'hello' expecting ID
line 1:29 extraneous input 'dearest' expecting {'.', ID}
Testing rule fixedSentence: no errors.
Testing rule niceSentence:
line 1:6 extraneous input 'hello' expecting {ID, KEYWORD}
line 1:29 extraneous input 'dearest' expecting {KEYWORD2, ID, KEYWORD}
line 1:57 extraneous input '.' expecting {KEYWORD2, ID, KEYWORD}
And if it helps to see the parse trees, here they are.

Recognize that the parser is ideally suited to handling syntax, i.e., structure, and not semantic distinctions. Whether a keyword is an ID terminator in one context and not in another, both being syntactically equivalent, is inherently semantic.
The typical ANTLR approach to handling semantic ambiguities is to create a parse tree recognizing as many structural distinctions as reasonably possible, and then walk the tree analyzing each node in relation to the surrounding nodes (in this case) to resolve ambiguities.
If this resolves to your parser being
sentences : ( ID+ PERIOD )* EOF ;
then your sentences are essentially free form. The more appropriate tool might be an NLP library - Stanford has a nice one.
Additional
If you define your lexer rules as
KEYWORD1 : 'hello' | 'howdy' | 'hi' ;
KEYWORD2 : 'bye' | 'goodbye' | 'adios' ;
KEYWORD3 : 'dear' | 'dearest' ;
KEYWORD4 : 'love' | 'yours' ;
. . . .
KEYWORD : KEYWORD1 | KEYWORD2 | KEYWORD3 | KEYWORD4 ;
the lexer will never emit a KEYWORD token - 'hello' is consumed and emitted as a KEYWORD1 and the KEYWORD rule is never evaluated. Since the parse tree fails to identify the type of the tokens (apparently) it is not very illuminating. Dump the token stream to see what the lexer is actually doing
hello hello kitty goodbye my dearest ...
KEYWORD1 KEYWORD1 ID KEYWORD2 ID KEYWORD3 ...
If you place the KEYWORD rule before the others, then the lexer is going to only emit KEYWORD tokens.
Changing to parser rules
niceID : ( ID | keyword ) ;
keyword : KEYWORD1 | KEYWORD2 | KEYWORD3 | KEYWORD4 ;
will allow this very limited example to work.

Related

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

I am creating an interpreter in Java using ANTLR. I have a grammar which I have been using for a long time and I have built a lot of code around classes generated from this grammar.
In the grammar is 'false' defined as a literal, and there is also definition of variable name which allows to build variable names from digits, numbers, underscores and dots (see the definition bellow).
The problem is - when I use 'false' as a variable name.
varName.nestedVar.false. The rule which marks false as falseLiteral takes precedence.
I tried to play with the white spaces, using everything I found on the internet. Solution when I would remove WHITESPACE : [ \t\r\n] -> channel (HIDDEN); and use explicit WS* or WS+ in every rule would work for the parser, but I would have to adjust a lot of code in the AST visitors. I try to tell boolLiteral rule that it has to have some space before the actual literal like WHITESPACE* trueLiteral, but this doesn't work when the white spaces are sent to the HIDDEN channel. And again disable it altogether = lot of code rewriting. (Since I often rely on the order of tokens.) I also tried to reorder non-terminals in the literal rule but this had no effect whatsoever.
...
literal:
boolLiteral
| doubleLiteral
| longLiteral
| stringLiteral
| nullLiteral
| varExpression
;
boolLiteral:
trueLiteral | falseLiteral
;
trueLiteral:
TRUE
;
falseLiteral:
FALSE
;
varExpression:
name=qualifiedName ...
;
...
qualifiedName:
ID ('.' (ID | INT))*
...
TRUE : [Tt] [Rr] [Uu] [Ee];
FALSE : [Ff] [Aa] [Ll] [Ss] [Ee];
ID : (LETTER | '_') (LETTER | DIGIT | '_')* ;
INT : DIGIT+ ;
POINT : '.' ;
...
WHITESPACE : [ \t\r\n] -> channel (HIDDEN);
My best bet was to move qualifiedName definition to the lexer lure
qualifiedName:
QUAL_NAME
;
QUAL_NAME: ID ('.' (ID | INT))* ;
Then it works for
varName.false AND false
varName.whatever.ntimes AND false
Result is correct -> varExpression->quilafiedName on the left-hand side and boolLiteral -> falseLiteral on the right-hand side.
But with this definition this doesn't work, and I really don't know why
varName AND false
Qualified name without . returns
line 1:8 no viable alternative at input 'varName AND'
Expected solution would be ether enable/disable whitespace -> channel{hiddne} for specific rules only
Tell the boolLiteral rule that it canNOT start start with dot, someting like ~POINT falseLiteral, but I tried this as well and with no luck.
Or get qualifiedName working without dot when the rule is moved to the lexer rule.
Thanks.

You could do something like this:
qualifiedName
: ID ('.' (anyId | INT))*
;
anyId
: ID
| TRUE
| FALSE
;

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?

Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

Match most specific rule

In my grammar, I want to have both "variable identifiers" and "function identifiers". Essentially, I want to be less restrictive on the characters allowed in function identifiers. However, I am running in to the issue that all variable identifiers are valid function identifiers.
As an example, say I want to allow uppercase letters in a function identifier but not in a variable identifier. My current (presumably naive) might look like:
prog : 'func' FunctionId
| 'var' VariableId
;
FunctionId : [a-zA-Z]+ ;
VariableId : [a-z]+ ;
With the above rules, var hello fails to parse. If I understand correctly, this is because FunctionId is defined first, so "hello" is treated as a FunctionId.
Can I make antlr choose the more specific valid rule?

An explanation why your grammar does not work as expected could be found here.
You can solve this with semantic predicates:
grammar Test;
prog : 'func' functionId
| 'var' variableId
;
functionId : Id;
variableId : {isVariableId(getCurrentToken().getText())}? Id ;
Id : [a-zA-Z]+;
On the lexer level there will be only ids. On the parser level you can restrict an id to lowercase characters. isVariableId(String) would look like:
public boolean isVariableId(String text) {
return text.matches("[a-z]+");
}

Can I make antlr choose the more specific valid rule?
No (as already mentioned). The lexer merely matches as much as it can, and in case 2 or more rules match the same, the one defined first "wins". There is no way around this.
I'd go for something like this:
prog : 'func' functionId
| 'var' variableId
;
functionId : LowerCaseId | UpperCaseId ;
variableId : LowerCaseId ;
LowerCaseId : [a-z]+ ;
UpperCaseId : [A-Z] [a-zA-Z]* ;

Is there a way to resolve this ambiguity without using predicates?

Grammar:
grammar Test;
file: (procDef | statement)* EOF;
procDef: 'procedure' ID NL statement+ ;
statement: 'statement'? NL;
WS: (' ' | '\t') -> skip;
NL: ('\r\n' | '\r' | '\n');
ID: [a-zA-Z0-9]+;
Test data:
statement
procedure Proc1
statement
statement
The parser does what I want (i.e. statement+ is greedy), but it reports an ambiguity because it doesn't know whether the last statement belongs to procDef or file (as I understand it).
As predicates are language dependent I'd prefer not to use one.
The procedure is supposed to end when a statement that can't belong to it, such as 'procedure', occurs.
I also would prefer to have the statements bound to the procedure to avoid having to rearrange the structure later.
Edit
It seems I should expand my test data a bit (but I will leave the original as it is small and shows the ambiguity I want to solve).
I want to be able to handle situations like this:
statement
procedure Proc1
statement
statement
procedure Proc2
statement
statement
procedure Proc2a
statement
statement
global
statement
procedure Proc3
statement
statement
(The indentation is not significant.) I can do it without predicates with something like
file: (
commonStatement
| globalStatement
)* EOF;
procDef: 'procedure' ID NL commonStatement+ ;
commonStatement: 'statement'? NL;
globalStatement: 'global' NL | procDef (globalStatement | EOF);
but then the tree becomes deeper with each consecutive procDef, and that feels very undesirable.
Then a solution with predicates is actually preferable.
#parser::members { boolean inProc; }
file: (
{!inProc}? commonStatement
| globalStatement
)* EOF;
procDef: 'procedure' ID {inProc = true;} NL commonStatement+ ;
commonStatement: 'statement'? NL;
globalStatement: ('global' NL {inProc = false;} | procDef) ;
The situation is actually worse than this, as globally acessible commonStatements can occur without an intervening globalStatement (accessible through gotos), but there is no way a parser can distinguish between that and statements belonging to the procedure, so my plan was to just discourage such use (and I don't think it's common). In fact, it is perfectly legal to jump into procedure code as well ...
It may turn out that in the end I will have to examine runtime paths anyway (scope is very much determined at runtime), and the grammar might end up something like
file: (
commonStatement
| globalStatement
| procDef
)* EOF;
procDef: 'procedure' ID NL procStatement*;
commonStatement: 'statement'? NL;
procStatement: 'proc' NL;
globalStatement: 'global' NL;
We will see ...

By your criteria, it is impossible for a statement to follow a procDef. You are well within your rights to design a language that way, but I hope you have an answer ready for the FAQ "How do I write a statement which comes after a procedure definition."
Writing the grammar is the easy part:
file: statement* procDef* EOF;

How to parse keywords as normal words some of the time in ANTLR4

I have a language with keywords like hello that are only keywords in certain types of sentences. In other types of sentences, these words should be matched as an ID, for example. Here's a super simple grammar that tells the story:
grammar Hello;
file : ( sentence )* ;
sentence : 'hello' ID PERIOD
| INT ID PERIOD;
ID : [a-z]+ ;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
PERIOD : '.' ;
I'd like these sentences to be valid:
hello fred.
31 cheeseburgers.
6 hello.
but that last sentence doesn't work in this grammar. The word hello is a token of type hello and not of type ID. It seems like the lexer grabs all the hellos and turns them into tokens of that type.
Here's a crazy way to do it, to explain what I want:
sentence : 'hello' ID PERIOD
| INT crazyID PERIOD;
crazyID : ID | 'hello' ;
but in my real language, there are a lot of keywords like hello to deal with, so, yeah, that way seems crazy.
Is there a reasonable, compact, target-language-independent way to handle this?

A standard way of handling keywords:
file : ( sentence )* EOF ;
sentence : key=( KEYWORD | INT ) id=( KEYWORD | ID ) PERIOD ;
KEYWORD : 'hello' | 'goodbye' ; // list others as alts
PERIOD : '.' ;
ID : [a-z]+ ;
INT : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
The seeming ambiguity between the KEYWORD and ID rules is resolved based on the KEYWORD rule being listed before the ID rule.
In the parser SentenceContext, TerminalNode variables key and id will be generated and, on parsing, will effectively hold the matched tokens, allowing easy positional identification.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Non-greedy matching of tokens from lists in ANTLR4 - antlr4

Related

ANTLR grammar: Boolean literal which can occur as qualified variable name while ignoring whitespace

Antlr4 grammar wouldn't parse multiline input

Match most specific rule

Is there a way to resolve this ambiguity without using predicates?

How to parse keywords as normal words some of the time in ANTLR4

Categories

Resources