ANTLR 4.1 Variable ANTLR 4 token multiplicity yields error: "closure with at least one alternative that can match empty string" - antlr4

Basically what I'm trying to do is create a grammar for Internationalized Resource Identifiers in ANTLR 4.1. The hardest time I've had thus far is trying to get the production rule for ipv6address working correctly. The way ipv6address is defined in RFC 3987 is that there are basically 9 different alternatives in ABNF format for that production rule alone:
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
Here, ls32 and h16 are both subrules defined as:
ls32 = ( h16 ":" h16 ) / IPv4address
And as such for h16:
h16 = 1*4HEXDIG
Where HEXDIG is a lexer rule for valid hexadecimal digits. I've tried to write this ABNF grammar with ANTLR syntax like such:
grammar IRI;
iri : scheme ':' ihier_part ('?' iquery)? ('#' ifragment)? ;
ihier_part : ('//' iauthority ipath_abempty
| ipath_absolute
| ipath_rootless)?
;
iri_reference : iri
| irelative_ref
;
absolute_IRI : scheme ':' ihier_part ('?' iquery)? ;
irelative_ref : irelative_part ('?' iquery)? ('#' ifragment)? ;
irelative_part : ('//' iauthority ipath_abempty
| ipath_absolute
| ipath_noscheme)?
;
iauthority : (iuserinfo '#')? ihost (':' port)? ;
iuserinfo : (iunreserved | pct_encoded | sub_delims | ':')* ;
ihost : ip_literal
| ipv4address
| ireg_name
;
ireg_name : (iunreserved | pct_encoded | sub_delims)* ;
ipath : (ipath_abempty
| ipath_absolute
| ipath_noscheme
| ipath_rootless)?
;
ipath_abempty : ('/' isegment)* ;
ipath_absolute : '/' (isegment_nz ('/' isegment)*)? ;
ipath_noscheme : isegment_nz_nc ('/' isegment)* ;
ipath_rootless : isegment_nz ('/' isegment)* ;
isegment : (ipchar)* ;
isegment_nz : (ipchar)+ ;
isegment_nz_nc : (iunreserved | pct_encoded | sub_delims | '#')+ ;
ipchar : iunreserved
| pct_encoded
| sub_delims
| ':'
| '#'
;
iquery : (ipchar | IPRIVATE | '/' | '?')* ;
ifragment : (ipchar | '/' | '?')* ;
iunreserved : ALPHA
| DIGIT
| '-'
| '.'
| '_'
| '~'
| UCSCHAR
;
fragment
UCSCHAR : '\u00A0'..'\uD7FF' | '\uF900'..'\uFDCF' | '\uFDF0'..'\uFFEF'
| '\u40000'..'\u4FFFD' | '\u50000'..'\u5FFFD' | '\u60000'..'\u6FFFD'
| '\u70000'..'\u7FFFD' | '\u80000'..'\u8FFFD' | '\u90000'..'\u9FFFD'
| '\uA0000'..'\uAFFFD' | '\uB0000'..'\uBFFFD' | '\uC0000'..'\uCFFFD'
| '\uD0000'..'\uDFFFD' | '\uE1000'..'\uEFFFD'
;
fragment
IPRIVATE : '\uE000'..'\uF8FF' | '\uF0000'..'\uFFFFD' | '\u100000'..'\u10FFFD' ;
scheme : ALPHA (ALPHA | DIGIT | '+' | '-' | '.')* ;
port : (DIGIT)* ;
ip_literal : '[' (ipv6address | ipvFuture) ']' ;
ipvFuture : 'v' (HEXDIG)+ '.' (unreserved | sub_delims | ':')+ ;
ipv6address
locals [int i1, i2, i3, i4, i5, i6, i7, i8, i9, i10 = 0;]
: ( {$i1<=6}? h16 ':' {$i1++;} )* ls32
| '::' ( {$i2<=5}? h16 ':' {$i2++;} )* ls32
| (h16)? '::' ( {$i3<=4}? h16 ':' {$i3++;} )* ls32
| ((h16 ':')? h16)? '::' ( {$i4<=3}? h16 ':'{$i4++;} )* ls32
| (( {$i5>=0 && $i5<=2}? h16 ':' {$i5++;} )* h16)? '::' ( {$i6<=2}? h16 ':' {$i6++;} )* ls32
| (( {$i7>=0 && $i7<=3}? h16 ':' {$i7++;} )* h16)? '::' h16 ':' ls32
| (( {$i8>=0 && $i8<=4}? h16 ':' {$i8++;} )* h16)? '::' ls32
| (( {$i9>=0 && $i9<=5}? h16 ':' {$i9++;} )* h16)? '::' h16
| (( {$i10>=0 && $i10<=6}? h16 ':' {$i10++;} )* h16)* '::'
;
h16
locals [int i = 1;]
: ( {$i>=1 && $i<=4}? HEXDIG {$i++;} )* ;
ls32 : h16 ':' h16 ;
ipv4address : DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET '.' DEC_OCTET ;
DEC_OCTET : '0'..'9'
| '10'..'99'
| '100'..'199'
| '200'..'249'
| '250'..'255'
;
pct_encoded : '%' HEXDIG HEXDIG ;
unreserved : ALPHA | DIGIT | '-' | '.' | '_' | '~' ;
reserved : gen_delims
| sub_delims
;
gen_delims : ':' | '/' | '?' | '#' | '[' | ']' | '#' ;
sub_delims : '!' | '$' | '&' | '\'' | '(' | ')' ;
DIGIT : [0-9] ;
HEXDIG : [0-9A-F] ;
ALPHA : [a-zA-Z] ;
WS : [' ' | '\t' | '\r' | '\n']+ -> skip ;
In my ANTLR grammar, I'm trying to use semantic predicates in order to specify the multiplicity rules defined in the ABNF grammer, both for ipv6address and h16. When I execute the org.antlr.v4.Tool class, I get the following output:
warning(125): IRI.g4:68:20: implicit definition of token 'IPRIVATE' in parser
warning(125): IRI.g4:78:4: implicit definition of token 'UCSCHAR' in parser
error(153): IRI.g4:100:0: rule 'ipv6address' contains a closure with at least one alternative that can match an empty string
warning(154): IRI.g4:40:0: rule 'ipath' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
warning(154): IRI.g4:100:0: rule 'ipv6address' contains an optional block with at least one alternative that can match an empty string
Obviously I'd like to get rid of the warnings as well, but I need to get rid of the error stating 'ipv6address' contains a closure with at least one alternative that can match an empty string. I've seen similar posts on StackOverflow about multiple alternatives errors. However, none of them dealt with closures that could match the empty string. I also am pretty sure I'm going to have to define the Unicode characters in UCSCHAR past \uFFFF as surrogate pairs, but that I'll take care of later. Just need to know how to get rid of the closure problem for now.

There are quite some things going wrong:
0
What 280Z28 said.
1
'250'..'255' does not match the strings "250" ... "255": you need to match the numeric ranges as described in the original ABNF specs:
ABNF
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255
ANTLR
dec_octet
: digit
| non_zero_digit digit
| D1 digit digit
| ...
;
2
You have a lot of conflicting lexer rules. Take these for example:
HEXDIG : [0-9A-F] ;
ALPHA : [a-zA-Z] ;
because HEXDIG is defined before ALPHA, the lexer will always create a HEXDIG when it sees 'A', for example. You must realize that the lexer does not produce tokens based on what the parser would like to receive. The lexer will go its own way and will never produce an ALPHA for the uppercase letters A-F.
3
fragment rules can only be used inside other lexer rules (or other fragment rules). You cannot use them inside parser rules.
4
Not really an issue, but the predicates make your grammar hard to read: if possible try to minimize predicates is my rule of thumb.
Your rule:
h16
locals [int i = 1;]
: ( {$i>=1 && $i<=4}? HEXDIG {$i++;} )* ;
could be written as:
h16
: HEXDIG HEXDIG HEXDIG HEXDIG
| HEXDIG HEXDIG HEXDIG
| HEXDIG HEXDIG
| HEXDIG
;
or even:
h16
: HEXDIG (HEXDIG (HEXDIG HEXDIG?)?)?
;
Most of these issues are easily fixed, but #2 is a more tricky one. What you could (should?) do is let the lexer create single-char tokens and let the parser match these single-char tokens into a whole. An example how you could let the parser match the dec-octet production from the official ABNF:
dec_octet
: digit // 0-9
| non_zero_digit digit // 10-99
| D1 digit digit // 100-199
| D2 (D0 | D1 | D2 | D3 | D4) digit // 200-249
| D2 D5 (D0 | D1 | D2 | D3 | D4 | D5) // 250-255
;
digit
: D0
| non_zero_digit
;
non_zero_digit
: D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 | D9
;
// lexer rules
D0 : '0';
D1 : '1';
D2 : '2';
D3 : '3';
D4 : '4';
D5 : '5';
D6 : '6';
D7 : '7';
D8 : '8';
D9 : '9';
I've once written an IRI grammar for ANTLR 3. If you want, I could put it in Github somewhere.

Your h16 rule uses (...)* instead of (...)+, which allows it to match 0 digits. When you place h16* in your grammar, it means you allow any number of nothings in your parse tree, which would always result in an infinite loop running your system out of memory (creating parse tree nodes with no tokens).

Related

How to parse statement in order of desired precedence using antlr?

I have an RSQL grammar defined:
grammar Rsql;
statement
: L_PAREN wrapped=statement R_PAREN
| left=statement op=( AND_OPERATOR | OR_OPERATOR ) right=statement
| node=comparison
;
comparison
: single_comparison
| multi_comparison
| bool_comparison
;
single_comparison
: key=IDENTIFIER op=( EQ | NE | GT | GTE | LT | LTE ) value=single_value
;
multi_comparison
: key=IDENTIFIER op=( IN | NIN ) value=multi_value
;
bool_comparison
: key=IDENTIFIER op=EX value=boolean_value
;
boolean_value
: BOOLEAN
;
single_value
: boolean_value
| ( STRING_LITERAL | IDENTIFIER )
| NUMERIC_LITERAL
;
multi_value
: L_PAREN single_value ( COMMA single_value )* R_PAREN
| single_value
;
TRUE: 'true';
FALSE: 'false';
AND_OPERATOR: ';';
OR_OPERATOR: ',';
L_PAREN: '(';
R_PAREN: ')';
COMMA: ',';
EQ: '==';
NE: '!=';
IN: '=in=';
NIN: '=out=';
GT: '=gt=';
LT: '=lt=';
GTE: '=ge=';
LTE: '=le=';
EX: '=ex=';
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]*
;
BOOLEAN
: TRUE
| FALSE
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( [-+]? DIGIT+ )?
| '.' DIGIT+ ( [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n"] )* '"'
;
STRING_ESCAPE_SEQ
: '\\' .
;
fragment DIGIT : [0-9];
No matter how I attempt to parse this (listener/visitor), the statements with parenthesis always get evaluated in order. It is my understanding that the order in the rule would be the precedence. However, the parse tree for a statement like "name==foo,(name==bar;age=gt=35)" is always
no matter where the parenthesis appear. Please help me discover what I'm missing. Thanks!

antlr4 grammar with negative option

In antlr4 I want to define a string but exclude from it the combination := permitting the respective single characters. What is syntax to define the grammar
EQUAL : '=';
NUMBER: DIGIT+;
DIGIT : ('0'..'9');
LITERALEQUAL: ((CHAR | NUMBER | EQUAL | OTHERS) ' '?)+;
fragment CHAR :[a-z]| [A-Z];
fragment OTHERS: '.' | '/' | ':' | '-' | '#' | '?' | '&' | '_' | '[' | ']' | '^' | ';' | '"' | '=';
As long as you don't make a lexer rule or implicit token like:
stmt : value ':=' something ; <-- implicit token
or
BADEquals : ':=' ; <-- explicit lexer definition
your eventual grammar won't allow it if your goal is to a allow : and = but exclude the combination := .

how can I refactor this ANTLR4 grammar so that it isn't mutually left recursive?

I can't seem to figure out why this grammar won't compile. It compiled fine until I modified line 145 from
(Identifier '.')* functionCall
to
(primary '.')? functionCall
I've been trying to figure out how to solve this issue for a while but I can't seem to be able to. Here's the error:
The following sets of rules are mutually left-recursive [primary]
grammar Tadpole;
#header
{package net.tadpole.compiler.parser;}
file
: fileContents*
;
fileContents
: structDec
| functionDec
| statement
| importDec
;
importDec
: 'import' Identifier ';'
;
literal
: IntegerLiteral
| FloatingPointLiteral
| BooleanLiteral
| CharacterLiteral
| StringLiteral
| NoneLiteral
| arrayLiteral
;
arrayLiteral
: '[' expressionList? ']'
;
expressionList
: expression (',' expression)*
;
expression
: primary
| unaryExpression
| <assoc=right> expression binaryOpPrec0 expression
| <assoc=left> expression binaryOpPrec1 expression
| <assoc=left> expression binaryOpPrec2 expression
| <assoc=left> expression binaryOpPrec3 expression
| <assoc=left> expression binaryOpPrec4 expression
| <assoc=left> expression binaryOpPrec5 expression
| <assoc=left> expression binaryOpPrec6 expression
| <assoc=left> expression binaryOpPrec7 expression
| <assoc=left> expression binaryOpPrec8 expression
| <assoc=left> expression binaryOpPrec9 expression
| <assoc=left> expression binaryOpPrec10 expression
| <assoc=right> expression binaryOpPrec11 expression
;
unaryExpression
: unaryOp expression
| prefixPostfixOp primary
| primary prefixPostfixOp
;
unaryOp
: '+'
| '-'
| '!'
| '~'
;
prefixPostfixOp
: '++'
| '--'
;
binaryOpPrec0
: '**'
;
binaryOpPrec1
: '*'
| '/'
| '%'
;
binaryOpPrec2
: '+'
| '-'
;
binaryOpPrec3
: '>>'
| '>>>'
| '<<'
;
binaryOpPrec4
: '<'
| '>'
| '<='
| '>='
| 'is'
;
binaryOpPrec5
: '=='
| '!='
;
binaryOpPrec6
: '&'
;
binaryOpPrec7
: '^'
;
binaryOpPrec8
: '|'
;
binaryOpPrec9
: '&&'
;
binaryOpPrec10
: '||'
;
binaryOpPrec11
: '='
| '**='
| '*='
| '/='
| '%='
| '+='
| '-='
| '&='
| '|='
| '^='
| '>>='
| '>>>='
| '<<='
| '<-'
;
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| (primary '.')? functionCall
;
functionCall
: functionName '(' expressionList? ')'
;
functionName
: Identifier
;
dimension
: '[' expression ']'
;
statement
: '{' statement* '}'
| expression ';'
| 'recall' ';'
| 'return' expression? ';'
| variableDec
| 'if' '(' expression ')' statement ('else' statement)?
| 'while' '(' expression ')' statement
| 'do' expression 'while' '(' expression ')' ';'
| 'do' '{' statement* '}' 'while' '(' expression ')' ';'
;
structDec
: 'struct' structName ('(' parameterList ')')? '{' variableDec* functionDec* '}'
;
structName
: Identifier
;
fieldName
: Identifier
;
variableDec
: type fieldName ('=' expression)? ';'
;
type
: primitiveType ('[' ']')*
| objType ('[' ']')*
;
primitiveType
: 'byte'
| 'short'
| 'int'
| 'long'
| 'char'
| 'boolean'
| 'float'
| 'double'
;
objType
: (Identifier '.')? structName
;
functionDec
: 'def' functionName '(' parameterList? ')' ':' type '->' functionBody
;
functionBody
: statement
;
parameterList
: parameter (',' parameter)*
;
parameter
: type fieldName
;
IntegerLiteral
: DecimalIntegerLiteral
| HexIntegerLiteral
| OctalIntegerLiteral
| BinaryIntegerLiteral
;
fragment
DecimalIntegerLiteral
: DecimalNumeral IntegerSuffix?
;
fragment
HexIntegerLiteral
: HexNumeral IntegerSuffix?
;
fragment
OctalIntegerLiteral
: OctalNumeral IntegerSuffix?
;
fragment
BinaryIntegerLiteral
: BinaryNumeral IntegerSuffix?
;
fragment
IntegerSuffix
: [lL]
;
fragment
DecimalNumeral
: Digit (Digits? | Underscores Digits)
;
fragment
Digits
: Digit (DigitsAndUnderscores? Digit)?
;
fragment
Digit
: [0-9]
;
fragment
DigitsAndUnderscores
: DigitOrUnderscore+
;
fragment
DigitOrUnderscore
: Digit
| '_'
;
fragment
Underscores
: '_'+
;
fragment
HexNumeral
: '0' [xX] HexDigits
;
fragment
HexDigits
: HexDigit (HexDigitsAndUnderscores? HexDigit)?
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
HexDigitsAndUnderscores
: HexDigitOrUnderscore+
;
fragment
HexDigitOrUnderscore
: HexDigit
| '_'
;
fragment
OctalNumeral
: '0' [oO] Underscores? OctalDigits
;
fragment
OctalDigits
: OctalDigit (OctalDigitsAndUnderscores? OctalDigit)?
;
fragment
OctalDigit
: [0-7]
;
fragment
OctalDigitsAndUnderscores
: OctalDigitOrUnderscore+
;
fragment
OctalDigitOrUnderscore
: OctalDigit
| '_'
;
fragment
BinaryNumeral
: '0' [bB] BinaryDigits
;
fragment
BinaryDigits
: BinaryDigit (BinaryDigitsAndUnderscores? BinaryDigit)?
;
fragment
BinaryDigit
: [01]
;
fragment
BinaryDigitsAndUnderscores
: BinaryDigitOrUnderscore+
;
fragment
BinaryDigitOrUnderscore
: BinaryDigit
| '_'
;
// §3.10.2 Floating-Point Literals
FloatingPointLiteral
: DecimalFloatingPointLiteral FloatingPointSuffix?
| HexadecimalFloatingPointLiteral FloatingPointSuffix?
;
fragment
FloatingPointSuffix
: [fFdD]
;
fragment
DecimalFloatingPointLiteral
: Digits '.' Digits? ExponentPart?
| '.' Digits ExponentPart?
| Digits ExponentPart
| Digits
;
fragment
ExponentPart
: ExponentIndicator SignedInteger
;
fragment
ExponentIndicator
: [eE]
;
fragment
SignedInteger
: Sign? Digits
;
fragment
Sign
: [+-]
;
fragment
HexadecimalFloatingPointLiteral
: HexSignificand BinaryExponent
;
fragment
HexSignificand
: HexNumeral '.'?
| '0' [xX] HexDigits? '.' HexDigits
;
fragment
BinaryExponent
: BinaryExponentIndicator SignedInteger
;
fragment
BinaryExponentIndicator
: [pP]
;
BooleanLiteral
: 'true'
| 'false'
;
CharacterLiteral
: '\'' SingleCharacter '\''
| '\'' EscapeSequence '\''
;
fragment
SingleCharacter
: ~['\\]
;
StringLiteral
: '"' StringCharacters? '"'
;
fragment
StringCharacters
: StringCharacter+
;
fragment
StringCharacter
: ~["\\]
| EscapeSequence
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| OctalEscape
| UnicodeEscape
;
fragment
OctalEscape
: '\\' OctalDigit
| '\\' OctalDigit OctalDigit
| '\\' ZeroToThree OctalDigit OctalDigit
;
fragment
ZeroToThree
: [0-3]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
NoneLiteral
: 'nil'
;
Identifier
: IdentifierStartChar IdentifierChar*
;
fragment
IdentifierStartChar
: [a-zA-Z$_] // these are the "java letters" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierStart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierStart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
fragment
IdentifierChar
: [a-zA-Z0-9$_] // these are the "java letters or digits" below 0xFF
| // covers all characters above 0xFF which are not a surrogate
~[\u0000-\u00FF\uD800-\uDBFF]
{Character.isJavaIdentifierPart(_input.LA(-1))}?
| // covers UTF-16 surrogate pairs encodings for U+10000 to U+10FFFF
[\uD800-\uDBFF] [\uDC00-\uDFFF]
{Character.isJavaIdentifierPart(Character.toCodePoint((char)_input.LA(-2), (char)_input.LA(-1)))}?
;
WS : [ \t\r\n\u000C]+ -> skip
;
LINE_COMMENT
: '#' ~[\r\n]* -> skip
;
The left recursive invocation needs to be the first, so no parenthesis can be placed before it.
You can rewrite it like this:
primary
: literal
| fieldName
| '(' expression ')'
| '(' type ')' (primary | unaryExpression)
| 'new' objType '(' expressionList? ')'
| primary '.' fieldName
| primary dimension
| primary '.' functionCall
| functionCall
;
which is equivalent.

Antlr - not able to use Lexer token if it is assigned to another token in grammar

"In this example if i use 'MID('int = VALUE')' then it works fine. I want MID to be validated for INT value but when i use INT it gives error "mismatched input '9' expecting INT.
I am using antlr-4.2-complete version of antlr.
I am not able to understand the exact issue?
grammar DIExpression;
r: 'MID('int_val = INT')'
{
System.out.println("value equals: "+ $int_val.text);
};
VALUE : INT | STRING;
STRING : [0-9a-zA-Z_]+;
INT : [0-9]+;
WS : [ \t\r\n]+ -> skip ;
UPDATE:
I am giving input like MID(9)
The issue is that your rules are ambiguous. What should '9' be? It could be a STRING or and INT. I would highly recommend to use the predefined literals for STRING, WS, COMMENT and NEWLINE provided by the antlr community.
Be aware, this is antlr3 code! As you can see a String is sth in quotes (I guess that´s also what you want)
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;

Antlr4 Grammar/Rules - issue with solving BASIC print variable

The scenario is that I want to create a BASIC (high level) language using ANTRL4.
The test input below is the creation of a variable called C$ and assigning an integer value. The value assignment works. The print statement works except where concatenating the variable to it:-
************ TEST CASE ****************
$C=15;
print "dangerdanger!"; # print works
print "Number of GB left=" + $C;
Using a Parse Tree Inspector I can see assignments are working fine but when it gets to the identification of the variable in the string it seems there is a mismatched input '+' expecting STMTEND.
I wondered if anyone could help me out here and see what adjustment I need to make to my rules and grammar to solve this issue.
Many thanks in advance.
Kevin
PS. As a side issue I would rather have C$ than $C but early days...
********RULES************
VARNAME : '$'('A'..'Z')*
;
CONCAT : '+'
;
STMTEND : SEMICOLON NEWLINE* | NEWLINE+
;
STRING : SQUOTED_STRING (CONCAT SQUOTED_STRING | CONCAT VARNAME)*
| DQUOTED_STRING (CONCAT DQUOTED_STRING | CONCAT VARNAME)*
;
fragment SQUOTED_STRING : '\'' (~['])* '\''
;
fragment DQUOTED_STRING
: '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
;
fragment ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment HEX_DIGIT : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+
;
fragment UNICODE_ESC : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
SEMICOLON : ';'
;
NEWLINE : '\r'?'\n'
************GRAMMAR************
print_command
: PRINT STRING STMTEND #printCommandLabel
;
assignment
: VARNAME EQUALS INTEGER STMTEND #assignInteger
| VARNAME EQUALS STRING STMTEND #assignString
;
You shouldn't try to create concat-expressions inside your lexer: that is the responsibility of the parser. Something like this should do it:
print_command
: PRINT STRING STMTEND #printCommandLabel
;
assignment
: VARNAME EQUALS expression STMTEND
;
expression
: expression CONCAT expression
| INTEGER
| STRING
| VARNAME
;
CONCAT
: '+'
;
VARNAME
: '$'('A'..'Z')*
;
STMTEND
: SEMICOLON NEWLINE*
| NEWLINE+
;
STRING
: SQUOTED_STRING
| DQUOTED_STRING
;
fragment SQUOTED_STRING
: '\'' (~['])* '\''
;
fragment DQUOTED_STRING
: '"' ( ESC_SEQ| ~('\\'|'"') )* '"'
;
fragment ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment HEX_DIGIT : '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
fragment UNICODE_ESC : '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
fragment SEMICOLON : ';';
fragment NEWLINE : '\r'?'\n';

Resources