I have a language where i want to parse unicode characters. Those characters are presided by %.
So this text: %,, this: a=&, or even this: (a,b)=%, should detect the ',' as unicode character.
It does so until i add the pattern for (a,b).
Here the code that works without (a,b):
grammar example;
test: expr | decl;
decl: (VARIABLE_DECLARATION? ID ) '=' expr
;
VARIABLE_DECLARATION
: 'public' | 'private'
;
expr: unicode;
unicode: '%' CHAR;
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
CHAR: // Other_Punctuation
'\u{0021}'..'\u{0023}' // [!..#] Basic Latin
| '\u{0025}'..'\u{0027}' // [%..'] Basic Latin
| '\u{002a}' // [*] Basic Latin
| '\u{002c}' // [,] Basic Latin
| '\u{002e}'..'\u{002f}' // [.../] Basic Latin
| '\u{003a}'..'\u{003b}' // [:..;] Basic Latin
| '\u{003f}'..'\u{0040}' // [?..#] Basic Latin
| '\u{005c}' // [\] Basic Latin
;
with this i get the following error: mismatched input ',' expecting CHAR
grammar example;
test: expr | decl;
decl: (VARIABLE_DECLARATION? ID | '('ID (',' ID)* ')' ) '=' expr
;
VARIABLE_DECLARATION
: 'public' | 'private'
;
expr: unicode;
unicode: '%' CHAR;
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
CHAR: // Other_Punctuation
'\u{0021}'..'\u{0023}' // [!..#] Basic Latin
| '\u{0025}'..'\u{0027}' // [%..'] Basic Latin
| '\u{002a}' // [*] Basic Latin
| '\u{002c}' // [,] Basic Latin
| '\u{002e}'..'\u{002f}' // [.../] Basic Latin
| '\u{003a}'..'\u{003b}' // [:..;] Basic Latin
| '\u{003f}'..'\u{0040}' // [?..#] Basic Latin
| '\u{005c}' // [\] Basic Latin
;
how can i solve that?
'\u{002c}' does indeed match , (though I don't understand why you'd write it as a Unicode escape instead of just ','). The problem is that you're also using ',' as a literal in your parser rules. This implicitly defines a lexer rule that matches only ,.
Lexer rules that are implicitly defines through the use of literals have higher priority than named lexer rules, so whenever the lexer sees a comma, it chooses to create a ',' token instead of a CHAR token.
To fix this I suggest you remove , from the set of characters matched by CHAR and instead use (CHAR | ',') wherever you want to allow both. You could even define a non-terminal char: CHAR | ','; and use that.
Related
I am a newbie to ANTLR4 and language compilers. I am working on building a language compiler using ANTLR4 Java. I have a small problem with parsing strings. The reserved words/ Tokens are getting matched instead of string. For eg: IF is a keyword token in my lexer but how to use "if" as a string?
Lexer file:
lexer grammar testgrammar;
IF : I F;
ENDIF : E N D I F;
ELSE : E L S E;
CASE : C A S E;
ENDCASE : E N D C A S E;
BREAK : B R E A K;
SWITCH : S W I T C H;
SUBSTRING : S U B S T R I N G;
COMMA : ',' ;
SEMI : ';' ;
COLON : ':' ;
LPAREN : '(' ;
RPAREN : ')' ;
DOT : '.' ;// ('.' {$setType(DOTDOT);})? ;
LCURLY : '{' ;
RCURLY : '}' ;
AND : '&&' ;
OR : '||' ;
DOUBLEQUOTES : '"' ;
COMPARATOR : '=='| '>=' | '>' | '<' | '<=' | '!=' ;
SYMBOLS : '§' | '$' | '%' | '/' | '=' | '?' | '#' | '_' | '#' | '€';
LETTER : [A-Za-z\u00e4\u00c4\u00d6\u00f6\u00dc\u00fc\u00df];
NUMERICVALUE : NUMBER ('.' NUMBER)?;
STRING_LITERAL : '\'' ('\'\'' | ~('\''))* '\'';
NOTCONDITION : NOT;
OPERATORS : OPERATOR;
COMMENT : (('/*' .*? '*/') | ('//' ~[\r\n]*)) -> skip;
WS : (' ' | '\t' | '\r' | '\n')+ -> skip;
fragment A:('a'|'A');
fragment B:('b'|'B');
fragment C:('c'|'C');
fragment D:('d'|'D');
fragment E:('e'|'E');
fragment F:('f'|'F');
fragment G:('g'|'G');
fragment H:('h'|'H');
fragment I:('i'|'I');
fragment J:('j'|'J');
fragment K:('k'|'K');
fragment L:('l'|'L');
fragment M:('m'|'M');
fragment N:('n'|'N');
fragment O:('o'|'O');
fragment P:('p'|'P');
fragment Q:('q'|'Q');
fragment R:('r'|'R');
fragment S:('s'|'S');
fragment T:('t'|'T');
fragment U:('u'|'U');
fragment V:('v'|'V');
fragment W:('w'|'W');
fragment X:('x'|'X');
fragment Y:('y'|'Y');
fragment Z:('z'|'Z');
fragment NUMBER:[0-9]+;
fragment OPERATOR: ('+'|'-'|'&'|'*'|'~');
fragment NOT: ('!');
grammar:
parser grammar testParser;
symbolCharacters: (SYMBOLS | operators) ;
word:
( symbolCharacters | LETTER )+
;
wordList:
word+
;
I am not supposed share full grammar. But i have shared enough information i guess. I can understand that the words are formed from LETTERS and Symbol characters. One workaround i can do is making word rule like:
word:
( symbolCharacters | LETTER | IF | SWITCH | CASE | ELSE | BREAK )+
;
I have a lot of tokens. I dont want to add everything individually. Is there any other nice way to accomplish this?
Valid expression
Error expression
How to make the parser ignore the keywords inside the string?
Your same grammar does not have the problem you describe:
➜ antlr4 testgrammar.g4
➜ javac *.java
➜ echo "if 'if' endif" | grun testgrammar tokens -tokens
[#0,0:1='if',<IF>,1:0]
[#1,3:6=''if'',<STRING_LITERAL>,1:3]
[#2,8:12='endif',<ENDIF>,1:8]
[#3,14:13='<EOF>',<EOF>,2:0]
(perhaps you have inadvertently "corrected" the problem as you trimmed your grammar down, so I'll elaborate a bit.)
In short, during the lexing/tokenization phase of ANTLR parsing your input, ANTLR will, naturally, attempt to match you Lexer rules. If ANTLR finds a match of multiple rules for the current characters of your input stream, it follows two rules to determine a "winner".
If a rule matches a longer sequence of input characters, then that rule will be used.
If two rules match the same number of input characters, then the rule appearing first in your grammar will be used.
In your case, neither really comes into play as the grammar, when it reaches the ', will attempt to complete the STRING_LITERAL rule, and will find a match for the characters 'if'. It will never even attempt to match you IF lexer rule.
BTW, I did have to correct the symbolCharacters parser rule to be
symbolCharacters: (SYMBOLS | OPERATORS);
I have a query grammar I am working on and have found one case that is proving difficult to solve. The below provides a minimal version of the grammar to reproduce it.
grammar scratch;
query : command* ; // input rule
RANGE: '..';
NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+));
STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' | '.' )+ ;
WS: [ \t\r\n]+ -> skip ;
command
: 'foo:' number_range # FooCommand
| 'bar:' item_list # BarCommand
;
number_range: NUMBER RANGE NUMBER # NumberRange;
item_list: '(' (NUMBER | STRING)+ ((',' | '|') (NUMBER | STRING)+)* ')' # ItemList;
When using this you can match things like bar:(bob, blah, 57, 4.5) foo:2..4.3 no problem. But if you put in bar:(bob.smith, blah, 57, 4.5) foo:2..4 it will complain line 1:8 token recognition error at: '.s' and split it into 'bob' and 'mith'. Makes sense, . is ignored as part of string. Although not sure why it eats the 's'.
So, change string to STRING: ~([ \t\r\n] | '(' | ')' | ':' | '|' | ',' )+ ; instead without the dot in it. And now it will recognize 2..4.3 as a string instead of number_range.
I believe that this is because the string matches more character in one stretch than other options. But is there a way to force STRING to only match if it hasn't already matched elements higher in the grammar? Meaning it is only a STRING if it does not contain RANGE or NUMBER?
I know I can add TERM: '"' .*? '"'; and then add TERM into the item_list, but I was hoping to avoid having to quote things if possible. But seems to be the only route to keep the .. range in, that I have found.
You could allow only single dots inside strings like this:
STRING : ATOM+ ( '.' ATOM+ )*;
fragment ATOM : ~[ \t\r\n():|,.];
Oh, and NUMBER: ([0-9]+ | (([0-9]+)? '.' [0-9]+)); is rather verbose. This does the same: NUMBER : ( [0-9]* '.' )? [0-9]+;
I want to match the following text:
test.define_shared_constant(:testConst, "12", false)
With this grammar it matches correctly:
grammar test;
statement: shared_constant_defioniton | method_call;
KEY: ':' ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'?'|'!'|'|'|'-'|'()')+;
expr: STRING;
STRING: '"' (~'"')* ('"' | NEWLINE) | '\'' (~'\'')* ('\'' | NEWLINE);
NEWLINE: '\r'? '\n' | '\r';
BOOLEAN: 'true' | 'false';
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
shared_constant_defioniton
: ID('.define_shared_constant' '(' KEY ',' expr ',' (BOOLEAN) ')')
;
method_call
: ID '.' ID? '('expr*(',' expr)*')'
;
With this grammar it does not match. It matches to method_call which is not even correct.
shared_constant_defioniton
: ID('.' DEF_SHARED_CONSTANT '(' KEY ',' expr ',' (BOOLEAN) ')')
;
It is interpreting 'define_shared_constant' as ID. So I have to specify that ID should not contain 'define_'. But how can I do that?
ID: ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
WS : [ \t\n\r]+ -> channel(HIDDEN);
DEF_SHARED_CONSTANT: 'define_shared_constant';
Here both ID and DEF_SHARED_CONSTANT could match the input define_shared_constant. In cases like this where multiple rules could match and would produce a match of the same length, the rule that's defined first wins. So defined_shared_constant is recognized as an ID token because ID is defined first.
To get the behaviour you want, you should move the definition of DEF_SHARED_CONSTANT before the definition of ID. If you don't define a named lexer rule for it at all and instead use 'define_shared_constant' directly in the parser rule, that also works because implicitly defined lexer rules act as if they had been defined at the beginning of the file.
This worked according to ANTLR specification. But running it as an IntelliJ Language plugin did not. I had use a predicated and the final solution looks like this:
ID: { getText().indexOf("define") == 0}? ('a'..'z'|'A'..'Z'|'!') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'!'|'?')*;
I've been playing around with modes in an attempt to parse a message like this:
-MSGTXT (DO NOT TOKENIZE (THERE CAN BE PARENS HERE) THIS PART)
-END END OF MESSAGE
-TEST 123
The contents of MSGTXT can be any character so I set up my lexer grammar as follows:
lexer grammar ADEXPLexer;
// Fields
MSGTYP: 'MSGTYP';
ADEP: 'ADEP';
TITLE: 'TITLE';
FILTIM: 'FILTIM';
ORIGINDT: 'ORIGINDT';
IFPLID: 'IFPLID';
MSGTXT: 'MSGTXT' -> pushMode(MSG);
COMMENT: 'COMMENT';
// Message types.
ACK: 'ACK';
IFPL: 'IFPL';
// Lexical rules.
SEP: HYPHEN;
WS: [ \t\n\r] + -> skip;
KEYWORD: (ALPHA|DIGIT)+;
mode MSG;
TEXT: CLOSE_MSG | (ALPHA|DIGIT|SPECIAL|WS|HYPHEN)+;
CLOSE_MSG: ')' -> popMode;
fragment HYPHEN: '-';
fragment ALPHA: [A-Z];
fragment DIGIT: [0-9];
fragment SPECIAL
: '('
| '?'
| ':'
| '.'
| ','
| '\''
| '='
| '+'
| '/'
| ')'
;
The problem now however is that the last closing ')' is never used to break out back into the default mode so it continues on into other parts of the message. The parser rule itself looks like this:
msgtxt: SEP MSGTXT TEXT;
I'm looking for a way to get around this which doesn't involve TokenStreamRewriter as there's no such thing in the JavaScript runtime.
Any help appreciated!
Not sure what you need exactly, but if you don't need to check if contents of the TEXT is one of the (ALPHA|DIGIT|SPECIAL|WS|HYPHEN) just use this:
mode MSG;
TEXT: ~[)]+;
CLOSE_MSG: ')' -> popMode;
if you do, just exclude ')' from fragment SPECIAL
I'm working on parsing PDF streams. In section 7.3.4.2 on literal string objects, the PDF Reference says that a backslash within a literal string that isn't followed by an end-of-line character, one to three octal digits, or one of the characters "nrtbf()\" should be ignored. Is there a way to get the recover method in my lexer to ignore a backslash in this situation?
Here is my simplified parser:
parser grammar PdfStreamParser;
options { tokenVocab=PdfSteamLexer; }
array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
: NULL
| array
| dictionary
| BOOLEAN
| NUMBER
| string
| NAME
;
content : stat* ;
stat
: tj
;
tj: ((string Tj) | (array TJ)) ; // Show text
Here's the lexer. (Based on the advice in this answer I'm not using a separate string mode):
lexer grammar PdfStreamLexer;
Tj: 'Tj' ;
TJ: 'TJ' ;
NULL: 'null' ;
BOOLEAN: ('true'|'false') ;
LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;
NUMBER: ('+' | '-')? (INT | FLOAT) ;
NAME: '/' ID ;
// A sequence of literal characters enclosed in parentheses.
LITERAL_STRING: '(' ( ~[()\\]+ | ESCAPE_SEQUENCE | LITERAL_STRING )* ')' ;
// Escape sequences that can occur within a LITERAL_STRING
fragment ESCAPE_SEQUENCE
: '\\' ( [\r\nnrtbf()\\] | [0-7] [0-7]? [0-7]? )
;
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; // Hexadecimal data enclosed in angle brackets
fragment INT: DIGIT+ ; // match 1 or more digits
fragment FLOAT: DIGIT+ '.' DIGIT* // match 1. 39. 3.14159 etc...
| '.' DIGIT+ // match .1 .14159
;
fragment DIGIT: [0-9] ; // match single digit
// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;
WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters
I can override the recover method in the PdfStreamLexer class and get notified when the LexerNoViableAltException occurs, but I'm not sure how to (or if it's possible to) ignore the backslash and continue on with the LITERAL_STRING tokenization.
To be able to skip part of the string, you'll need to use lexical modes. Here's a quick demo:
lexer grammar DemoLexer;
STRING_OPEN
: '(' -> pushMode(STRING_MODE)
;
SPACES
: [ \t\r\n] -> skip
;
OTHER
: .
;
mode STRING_MODE;
STRING_CLOSE
: ')' -> popMode
;
ESCAPE
: '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
;
STRING_PART
: ~[\\()]
;
NESTED_STRING_OPEN
: '(' -> type(STRING_OPEN), pushMode(STRING_MODE)
;
IGNORED_ESCAPE
: '\\' . -> skip
;
which can be used in the parser as follows:
parser grammar DemoParser;
options {
tokenVocab=DemoLexer;
}
parse
: ( string | OTHER )* EOF
;
string
: STRING_OPEN ( ESCAPE | STRING_PART | string )* STRING_CLOSE
;
If you now parse the string FU(abc(def)\#\))BAR, you will get the following parse tree:
As you can see, the \) is left in the tree, but \# is omitted.