Choosing lexer mode based on variable - antlr4

My lexer (target language C++) contains a simple rule for parsing a string literal:
STRING: '"' ~'"'+ '"';
But based on the value returned by a function, I want my lexer to return either a STRING or an IDENT.
I've tried the following:
STRING_START: '"' -> mode(current_string_mode());
or
STRING_START: '"' -> mode(current_string_mode() == IDENT ? MODE_IDENT : MODE_STRING) ;
In either case, I get an error when trying to generate the lexer (error message says:'"' came as a complete surprise)

Alas, that is not possible.
If I look at the grammar of ANTLR itself, I see this:
lexerCommands
: RARROW lexerCommand (COMMA lexerCommand)*
;
lexerCommand
: lexerCommandName LPAREN lexerCommandExpr RPAREN
| lexerCommandName
;
lexerCommandName
: identifier
| MODE
;
lexerCommandExpr
: identifier
| INT
;
In short: the part between parenthesis (mode(...) or pushMode(...)) must be an identifier, or an integer literal. It cannot be an expression (what you're trying to do).

Related

Parsing key / value subparameters

I'm a bit clueless as to how I can parse (more or less) "free form" parameter lists, suppose the syntax allows for
PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE)
I have managed to basically parse combos of positional and key/value parameters, but as soon as I hit a lexer token like PARM= the parser bails out with a "mismatched input", and I can't specifically allow for or expect anything because these parameters passed to a function are completely arbitrary.
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode, the delimiters are PARM=( on the left and the closing ) on the right, but as the "data" itself can contain (pairs of) brackets how would I identify the correct closing paren so I don't prematurely end the lexer mode?
TIA - Alex
Edit 1:
Minimal grammar showing the issue with keywords being used where they shouldn't, as this is part of a complex grammar I can't change the order of tokens to put ID in front of everything else, for example, as it would catch too much. So I don't see how this can work short of breaking out into a different lexer mode.
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN) ;
COMMA : ',' ;
EQUALS : '=' ;
LPAREN : '(' ;
RPAREN : ')' ;
PARM : 'PARM=' ;
ID : ID_LITERAL ;
fragment ID_LITERAL : [A-Za-z]+ ;
.
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parms : PARM LPAREN parm+ RPAREN ;
parm : (pkey=ID EQUALS)? pval=ID COMMA? ;
Input:
PARM=( TEST, KEY=VAL, PARM=X)
Results in
line 1:22 extraneous input 'PARM=' expecting {')', ID}
So I'd think I'll need to switch to a specific lexer mode but right now I can't see how I would properly switch back to "normal" mode
Instead of switching to modes (with -> mode(...)), you can push your "special" mode on a stack (with -> pushMode(...)) and then when encountering a ) you pop a mode from the stack. That way, you can have multiple nested lists (..(..(..).)..). A quick demo:
lexer grammar ParmLexer;
SPACE : [ \t\r\n]+ -> channel(HIDDEN);
EQUALS : '=' ;
LPAREN : '(' -> pushMode(InList);
PARM : 'PARM';
ID : [A-Za-z] [A-Za-z0-9]*;
mode InList;
LST_LPAREN : '(' -> type(LPAREN), pushMode(InList);
RPAREN : ')' -> popMode;
COMMA : ',';
LST_EQUALS : '=' -> type(EQUALS);
STRING : '\'' ~['\r\n]* '\'';
LST_ID : [A-Za-z] [A-Za-z0-9]* -> type(ID);
LST_SPACE : [ \t\r\n]+ -> channel(HIDDEN);
and:
parser grammar ParmParser;
options { tokenVocab=ParmLexer; }
parse
: PARM EQUALS list EOF
;
list
: LPAREN ( value ( COMMA value )* )? RPAREN
;
value
: ID
| STRING
| key_value
| ID list
;
key_value
: ID EQUALS value
;
which will parse your example input PARM=(VAL1, 'VAL2', VAL3, KEY4=VAL4, KEY5=VAL5(XYZ), PARM=ABC, SOMETHING=ELSE) like this:
You don't have a rule (alternative) that recognizes a PARM token in your parm rule.
Bart has provided an answer using Lexer modes (and assuming that LPAREN and RPAREN always control those modes), but you can also just set up a parser rule that matches all of your keywords:
lexer grammar ParmLexer
;
SPACE: [ \t\r\n]+ -> channel(HIDDEN);
COMMA: ',';
EQUALS: '=';
LPAREN: '(';
RPAREN: ')';
PARM: 'PARM';
KW1: 'KW1';
KW2: 'KW2';
ID: ID_LITERAL;
fragment ID_LITERAL: [A-Za-z]+;
parser grammar ParmParser
;
options {
tokenVocab = ParmLexer;
}
parms: PARM EQUALS LPAREN parm (COMMA parm)* RPAREN;
parm: ((pkey = ID | kwid = kw) EQUALS)? pval = ID;
kw: PARM | KW1 | KW2;
input
"PARM=( TEST, KEY=VAL, KW2=v2, PARM=X)"
yields:
(parms PARM = ( (parm TEST) , (parm KEY = VAL) , (parm (kw KW2) = v) , (parm (kw PARM) = X) ))

Antlr4 grammar wouldn't parse multiline input

I want to write a grammar using Antlr4 that will parse a some definition but I've been struggling to get Antlr to co-operate.
The definition has two kinds of lines, a type and a property. I can get my grammar to parse the type line correctly but it either ignores the property lines or fails to identify PROPERTY_TYPE depending on how I tweak my grammar.
Here is my grammar (attempt # 583):
grammar TypeDefGrammar;
start
: statement+ ;
statement
: type NEWLINE
| property NEWLINE
| NEWLINE ;
type
: TYPE_KEYWORD TYPE_NAME; // e.g. 'type MyType1'
property
: PROPERTY_NAME ':' PROPERTY_TYPE ; // e.g. 'someProperty1: int'
TYPE_KEYWORD
: 'type' ;
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
fragment LETTER
: [a-zA-Z] ;
fragment DIGIT
: [0-9] ;
NEWLINE
: '\r'? '\n' ;
WS
: [ \t] -> skip ;
Here is a sample input:
type SimpleType
intProp1: int
stringProp2 : String
(returns the type but ignores intProp1, stringProp2.)
What am I doing wrong?
Usually when a rule does not match the whole input, but does match a prefix of it, it will simply match that prefix and leave the rest of the input in the stream without producing an error. If you want your rule to always match the whole input, you can add EOF to the end of the rule. That way you'll get proper error messages when it can't match the entire input.
So let's change your start rule to start : statement+ EOF;. Now applying start to your input will lead to the following error messages:
line 3:0 extraneous input 'intProp1' expecting {, 'type', PROPERTY_NAME, NEWLINE}
line 4:0 extraneous input 'stringProp2' expecting {, 'type', PROPERTY_NAME, NEWLINE}
So apparently intProp1 and stringProp2 aren't recognized as PROPERTY_NAMEs. So let's look at which tokens are generated (you can do that using the -tokens option to grun or by just iterating over the token stream in your code):
[#0,0:3='type',<'type'>,1:0]
[#1,5:14='SimpleType',<TYPE_NAME>,1:5]
[#2,15:15='\n',<NEWLINE>,1:15]
[#3,16:16='\n',<NEWLINE>,2:0]
[#4,17:24='intProp1',<TYPE_NAME>,3:0]
[#5,25:25=':',<':'>,3:8]
[#6,27:29='int',<TYPE_NAME>,3:10]
[#7,30:30='\n',<NEWLINE>,3:13]
[#8,31:41='stringProp2',<TYPE_NAME>,4:0]
[#9,43:43=':',<':'>,4:12]
[#10,45:50='String',<TYPE_NAME>,4:14]
[#11,51:51='\n',<NEWLINE>,4:20]
[#12,52:51='<EOF>',<EOF>,5:0]
So all of the identifiers in the code are recognized as TYPE_NAMEs, not PROPERTY_NAMEs. In fact, it is not clear what should distinguish a TYPE_NAME from a PROPERTY_NAME, so now let's actually look at your grammar:
TYPE_NAME
: IDENTIFIER ;
PROPERTY_NAME
: IDENTIFIER ;
PROPERTY_TYPE
: IDENTIFIER ;
fragment IDENTIFIER
: (LETTER | '_') (LETTER | DIGIT | '_' )* ;
Here you have three lexer rules with exactly the same definition. That's a bad sign.
Whenever multiple lexer rules can match on the current input, ANTLR chooses the one that would produce the longest match, picking the one that comes first in the grammar in case of ties. This is known as the maximum munch rule.
If you have multiple rules with the same definition, that means those rules will always match on the same input and they will always produce matches of the same length. So by the maximum much rule, the first definition (TYPE_NAME) will always be used and the other ones might as well not exist.
The problem basically boils down to the fact that there's nothing that lexically distinguishes the different types of names, so there's no basis on which the lexer could decide which type of name a given identifier represents. That tells us that the names should not be lexer rules. Instead IDENTIFIER should be a lexer rule and the FOO_NAMEs should either be (somewhat unnecessary) parser rules or removed altogether (you can just use IDENTIFIER wherever you're currently using FOO_NAME).

antlr4 all words except the operators

grammar TestGrammar;
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WORD : [a-z0-9._#+=]+(' '[a-z0-9._#+=]+)* ;
WS : [ \t\r\n]+ -> skip ;
quotedword : DQUOTE WORD DQUOTE;
expression
: LPAREN expression+ RPAREN
| expression (AND expression)+
| expression (OR​ expression)+
| expression (NOT​ expression)+
| NOT expression+
| quotedword
| WORD;
I've managed to implement the above grammar for antlr4.
I've got a long way to go but for now my question is,
how can I make WORD generic? Basically I want this [a-z0-9._#+=] to be anything except the operators (AND, OR, NOT, LPAREN, RPAREN, DQUOTE, SPACE).
The lexer will use the first rule that can match the given input. Only if that rule can't match it, it will try the next one.
Therefore you can make your WORD rule generic by using this grammar:
AND : 'AND' ;
OR : 'OR'|',' ;
NOT : 'NOT' ;
LPAREN : '(' ;
RPAREN : ')' ;
DQUOTE : '"' ;
WS : [ \t\r\n]+ -> skip ;
WORD: .+? ;
Make sure to use the non-greedy operator ? in this case becaue otherwise once invoked the WORD rule will consume all following input.
As WORD is specified last, input will only be tried to be consumed by it if all previous lexer rules (all that have been defined above in the source code) have failed.
EDIT: If you don't want your WORD rule to match any input then you just have to modify the rule I provided. But the essence of my answer is that in the lexer you don't have to worry about two rules potentially matching the same input as long as you got the order in the source code right.
Try something like this grammar:
grammar TestGrammar;
...
WORD : Letter+;
QUOTEDWORD : '"' (~["\\\r\n])* '"' // disallow quotes, backslashes and crlf in literals
WS : [ \t\r\n]+ -> skip ;
fragment Letter :
[a-zA-Z$_] // these are the "java letters" below 0x7F
| ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters above 0x7F which are not a surrogate
| [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
;
expression:
...
| QUOTEDWORD
| WORD+;
Maybe you want to use escape sequences in QUOTEDWORD, then look in this example how to do this.
This grammar allows you:
to have quoted words interpreted as string literals (preserving all spaces within)
to have multiple words separated by whitespace (which is ignored)

Is there a language-agnostic way to do simple predicates in the parser?

Goal
I want to reduce (or eliminate) the Java-specific actions and predicates in my parser. Perhaps it isn't possible, but I wanted to ask here just in case there's some ANTLR4 feature I've missed. (The language itself is third-party, so I don't have control over that.)
Simplified example
The predicates I want to use are mostly exact (or perhaps case-insensitive) string-matching. I could make big parallel sets of parser rules, but I'd rather not since the real-life example is considerably more convoluted.
Suppose I'm given something like:
isWidget(int) : "Whether it is a widget" : 4 ;
ownerFirstName(string) : "john" ;
ownerLastName(string) : "This is the last-name of the owner" : "doe" ;
I want the parser to look at the default-value (the last item on the line, like 4, "john" or "doe") and parse it based on the earlier type (int), (string), (string).
main
: stmt SEMIC (stmt SEMIC)* EOF
;
stmt
: propname=IDENTIFIER LPAREN datatype=IDENTIFIER RPAREN (COLON description=QUOTSTRING)? COLON df=defaultVal
;
defaultVal
: QUOTSTRING //TODO only this alt if datatype=string
| NUM //TODO only this alt if datatype=int
;
fragment Letter : 'a'..'z' | 'A'..'Z' ;
fragment Digit : '0'..'9' ;
fragment Underscore : '_' ;
SEMIC : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
IDENTIFIER : (Letter|Underscore) (Letter|Underscore|Digit)* ;
QUOTSTRING : '"' ~('"' |'\n' | '\r' | '\u2029' | '\u2028')* '"' ;
NUM : Digit+ ;
WS : [ \t\n\r]+ -> skip ;
I know I can do it with predicates and rule inputs, but then I'm crossing the line from a language-agnostic grammar to one with embedded Java code.
Your parser should handle things like the following without a problem:
isWidget(int) : "Whether it is a widget" : "foo" ;
In other words, do not add a predicate that would fail in this case, or you will lose the ability to report sane error messages. Instead, use a language-specific listener or visitor implementation after the parse is complete to report a semantic error if the type of the default value does not match the declared type.

string recursion antlr lexer token

How do I build a token in lexer that can handle recursion inside as this string:
${*anything*${*anything*}*anything*}
?
Yes, you can use recursion inside lexer rules.
Take the following example:
${a ${b} ${c ${ddd} c} a}
which will be parsed correctly by the following grammar:
parse
: DollarVar
;
DollarVar
: '${' (DollarVar | EscapeSequence | ~Special)+ '}'
;
fragment
Special
: '\\' | '$' | '{' | '}'
;
fragment
EscapeSequence
: '\\' Special
;
as the interpreter inside ANTLRWorks shows:
alt text http://img185.imageshack.us/img185/5471/recq.png
ANTLR's lexers do support recursion, as #BartK adeptly points out in his post, but you will only see a single token within the parser. If you need to interpret the various pieces within that token, you'll probably want to handle it within the parser.
IMO, you'd be better off doing something in the parser:
variable: DOLLAR LBRACE id variable id RBRACE;
By doing something like the above, you'll see all the necessary pieces and can build an AST or otherwise handle accordingly.

Resources