antlr4 gives token recognition error with lexer island grammar - antlr4
I need antlr4 to parse some simple HTML files. I have split my grammar into a parser grammar and a lexer grammar so I could use an island grammar for stuff inside tags (inside < and >) as described in "The Definitive ANTLR4 Reference". antlr4 repeatedly tells me "token recognition error".
parser grammar:
grammar Rule;
options {
tokenVocab = HTMLLexer;
language = Java;
}
/* Parser Rules */
doc : type? html ;
type : '<!DOCTYPE HTML>' ;
html : shtml head body ehtml ;
head : shead meta* ehead ;
meta : smeta ;
body : sbody ebody ;
shtml : '<' 'html' attr* '>' ;
ehtml : '<' '/html' '>' ;
shead : '<' 'head' attr* '>' ;
ehead : '<' '/head' '>' ;
smeta : '<' 'meta' attr+ '>' ;
sbody : '<' 'body' attr* '>' ;
ebody : '<' '/body' '>' ;
attr : NAME '=' VALUE ;
lexer grammer:
lexer grammar HTMLLexer;
COMMENT : '<!--' .*? '-->' -> skip ;
CDATA : '<![CDATA[' .*? ']]>' ;
OPEN : '<' -> pushMode(INSIDE) ;
SPEC_OPEN : '<!' -> pushMode(INSIDE) ;
TEXT : (ENTITY | ~[<&])+ ;
fragment ENTITY
: '&' [a-zA-Z]+ ';'
| '&#' [0-9]+ ';'
| '&#x' [0-9A-Za-z]+ ';' ;
mode INSIDE;
CLOSE : '>' -> popMode ;
SLASH_CLOSE : '/>' -> popMode ;
StHTML : 'html' ;
EnHTML : '/html' ;
StHead : 'head' ;
EnHead : '/head' ;
StMeta : 'meta' ;
StBody : 'body' ;
EnBody : '/body' ;
NAME : 'class'
| 'content'
| 'http-equiv'
| 'id'
| 'lang'
| 'name'
| 'style'
| 'type'
;
EQUALS : '=' ;
VALUE : ('"' ~["<>\r\n]+ '"')
| ('\'' ~['<>\r\n]+ '\'')
| ~["'<>= \t\r\n]+ ;
;
WS : [ \t\r\n]+ -> skip ;
sample HTML file:
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 14 (filtered)">
</head>
<body lang=EN-US style='text-justify-trim:punctuation'>
</body>
</html>
output from antlr4:
line 1:6 token recognition error at: '\n'
line 2:6 token recognition error at: '\n'
line 3:5 token recognition error at: ' '
line 3:6 token recognition error at: 'htt'
line 3:9 token recognition error at: 'p'
...
[#0,0:0='<',<7>,1:0]
[#1,1:4='html',<10>,1:1]
[#2,5:5='>',<1>,1:5]
[#3,7:7='<',<7>,2:0]
[#4,8:11='head',<6>,2:1]
[#5,12:12='>',<1>,2:5]
[#6,14:14='<',<7>,3:0]
[#7,15:18='meta',<2>,3:1]
[#8,30:30='=',<9>,3:16]
[#9,51:51='=',<9>,3:37]
[#10,57:61='/html',<4>,3:43]
[#11,71:71='=',<9>,3:57]
[#12,85:85='>',<1>,3:71]
[#13,87:87='<',<7>,4:0]
[#14,88:91='meta',<2>,4:1]
[#15,115:115='=',<9>,4:28]
[#16,146:146='>',<1>,4:59]
[#17,148:148='<',<7>,5:0]
[#18,149:153='/head',<8>,5:1]
[#19,154:154='>',<1>,5:6]
[#20,157:157='<',<7>,7:0]
[#21,158:161='body',<5>,7:1]
[#22,167:167='=',<9>,7:10]
[#23,179:179='=',<9>,7:22]
[#24,211:211='>',<1>,7:54]
[#25,213:213='<',<7>,8:0]
[#26,214:218='/body',<11>,8:1]
[#27,219:219='>',<1>,8:6]
[#28,221:221='<',<7>,9:0]
[#29,222:226='/html',<4>,9:1]
[#30,227:227='>',<1>,9:6]
[#31,229:228='<EOF>',<-1>,10:0]
line 3:16 mismatched input '=' expecting NAME
line 4:28 mismatched input '=' expecting NAME
line 7:10 mismatched input '=' expecting {'>', NAME}
First of all, you need to change the declaration of your parser to parser grammar Rule; instead of grammar Rule;. I don't see any problems with your lexer that would produce those particular error messages, so that could be the problem.
Related
Problem matching single digits when integers are defined as tokens
I'm having problem trying to get a grammar working. Here is the simplified version. The language I try to parse has expressions like these: testing1(2342); testing2(idfor2); testing3(4654); testing4[1..n]; testing5[0..1]; testing6(7); testing7(1); testing8(o); testing9(n); The problem arises when I introduce the rules for the [1..n] or [0..1] expressions. The grammar file (one of the many variations I've tried): grammar test; tests : test* ; test : call | declaration ; call : callName '(' callParameter ')' ';' ; callName : Identifier ; callParameter : Identifier | Integer ; declaration : declarationName '[' declarationParams ']' ';' ; declarationName : Identifier ; declarationParams : decMin '..' decMax ; decMin : '0' | '1' ; decMax : '1' | 'n' ; Integer : [0-9]+ ; Identifier : [a-zA-Z_][a-zA-Z0-9_]* ; WS : [ \t\r\n]+ -> skip ; When I parse the sample with this grammar, it fails on testing7(1); and testint(9);. It matches as decMin or decMax instead of Integer or Identifier: line 8:9 mismatched input '1' expecting {Integer, Identifier} line 10:9 mismatched input 'n' expecting {Integer, Identifier} I've tried many variations but I can't make it work fine.
I think your problem comes from not using lexer rules clearly defining what you want. When you added this rule : decMin : '0' | '1' ; You in fact created an unnamed lexer rule that matches '0' and another one matching '1' : UNNAMED_0_RULE : '0'; UNNAMED_1_RULE : '1'; And your parser rule became : decMin : UNNAMED_0_RULE | UNNAMED_1_RULE ; Problem : now, when your lexer see testing7(1); **it doesn't see ** callName '(' callParameter ')' ';' anymore, it sees callName '(' UNNAMED_1_RULE ')' ';' and it doesn't understand that. And that is because lexer rules are effective before the parser rules. To solve your problem, define your lexer rules efficiently, It would probably look like that : grammar test; /*---------------- PARSER ----------------*/ tests : test* ; test : call | declaration ; call : callName '(' callParameter ')' ';' ; callName : identifier ; callParameter : identifier | integer ; declaration : declarationName '[' declarationParams ']' ';' ; declarationName : identifier ; declarationParams : decMin '..' decMax ; decMin : INTEGER_ZERO | INTEGER_ONE ; decMax : INTEGER_ONE | LETTER_N ; integer : (INTEGER_ZERO | INTEGER_ONE | INTEGER_OTHERS)+ ; identifier : LETTER_N | IDENTIFIER ; /*---------------- LEXER ----------------*/ LETTER_N: N; IDENTIFIER : [a-zA-Z_][a-zA-Z0-9_]* ; WS : [ \t\r\n]+ -> skip ; INTEGER_ZERO: '0'; INTEGER_ONE: '1'; INTEGER_OTHERS: '2'..'9'; fragment N: [nN]; I just tested this grammar and it works. The drawback is that it will cut your integers at the lexer step (cutting 1245 into 1 2 4 5 in lexer rules, and the considering the parser rule as uniting 1 2 4 and 5). I think it would be better to be less precise and simply write : decMin: integer | identifier; But then it depends on what you do with your grammar...
Not Able to Recognize Strings and Characters in ANTLr
In my ANTLr code, we should be able to recognize strings, characters, hexadecimal numbers etc. However, in my code, when I test it like this: grun A1_lexer tokens -tokens test.txt With my test.txt file being a simple string, such as "pineapple", it is unable to recognize the different tokens. In my lexer, I define the following helper tokens: fragment Delimiter: ' ' | '\t' | '\n' ; fragment Alpha: [a-zA-Z_]; fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ; fragment Digit: ['0'-'9'] ; fragment Alpha_num: Alpha | Digit ; fragment Single_quote: '\'' ; fragment Double_quote: '\"' ; fragment Hex_digit: Digit | [a-fA-F] ; And I define the following tokens: Char_literal : (Single_quote)Char(Single_quote) ; String_literal : (Double_quote)Char*(Double_quote) ; Id: Alpha Alpha_num* ; I run it like this: grun A1_lexer tokens -tokens test.txt And it outputs this: line 1:0 token recognition error at: '"' line 1:1 token recognition error at: 'p' line 1:2 token recognition error at: 'ine' line 1:6 token recognition error at: 'p' line 1:7 token recognition error at: 'p' line 1:8 token recognition error at: 'l' line 1:9 token recognition error at: 'e"' [#0,5:5='a',<Id>,1:5] [#1,12:11='<EOF>',<EOF>,2:0] I am really wondering what the problem is and how I could fix it. Thanks. UPDATE 1: fragment Delimiter: ' ' | '\t' | '\n' ; fragment Alpha: [a-zA-Z_]; fragment Char: [a-zA-Z0-9] ; fragment Digit: [0-9] ; fragment Alpha_num: Alpha | Digit ; fragment Single_quote: '\'' ; fragment Double_quote: '\"' ; I have updated the code, I got rid of the un-necessary single quotes in my Char classification. However, I get the same output as before. UPDATE 2: Even when I make the changes suggested, I still get the same error. I believed the problem is that I am not recompiling, but I am. These are the steps that I take to recompile. antlr4 A1_lexer.g4 javac A1_lexer*.java chmod a+x build.sh ./build.sh grun A1_lexer tokens -tokens test.txt With my build.sh file looking like this: #!/bin/bash FILE="A1_lexer" ANTLR=$(echo $CLASSPATH | tr ':' '\n' | grep -m 1 "antlr-4.7.1- complete.jar") java -jar $ANTLR $FILE.g4 javac $FILE*.java Even when I recompile, my antlr code is still unable to recognize the tokens. My code is also now like this: fragment Delimiter: ' ' | '\t' | '\n' ; fragment Alpha: [a-zA-Z_]; fragment Char: [a-zA-Z0-9] ; fragment Digit: [0-9] ; fragment Alpha_num: Alpha | Digit ; fragment Single_quote: '\'' ; fragment Double_quote: '"' ; fragment Hex_digit: Digit | [a-fA-F] ; fragment Eq_op: '==' | '!=' ; Char_literal : (Single_quote)Char(Single_quote) ; String_literal : (Double_quote)Char*(Double_quote) ; Decimal_literal : Digit+ ; Id: Alpha Alpha_num* ; UPDATE 3: Grammar: program :'class Program {'field_decl* method_decl*'}' field_decl : type (id | id'['int_literal']') ( ',' id | id'['int_literal']')*';' | type id '=' literal ';' method_decl : (type | 'void') id'('( (type id) ( ','type id)*)? ')'block block : '{'var_decl* statement*'}' var_decl : type id(','id)* ';' type : 'int' | 'boolean' statement : location assign_op expr';' | method_call';' | 'if ('expr')' block ('else' block )? | 'switch' expr '{'('case' literal ':' statement*)+'}' | 'while (' expr ')' statement | 'return' ( expr )? ';' | 'break ;' | 'continue ;' | block assign_op : '=' | '+=' | '-=' method_call : method_name '(' (expr ( ',' expr )*)? ')' | 'callout (' string_literal ( ',' callout_arg )* ')' method_name : id location : id | id '[' expr ']' expr : location | method_call | literal | expr bin_op expr | '-' expr | '!' expr | '(' expr ')' callout_arg : expr | string_literal bin_op : arith_op | rel_op | eq_op | cond_op arith_op : '+' | '-' | '*' | '/' | '%' rel_op : '<' | '>' | '<=' | '>=' eq_op : '==' | '!=' cond_op : '&&' | '||' literal : int_literal | char_literal | bool_literal id : alpha alpha_num* alpha : ['a'-'z''A'-'Z''_'] alpha_num : alpha | digit digit : ['0'-'9'] hex_digit : digit | ['a'-'f''A'-'F'] int_literal : decimal_literal | hex_literal decimal_literal : digit+ hex_literal : '0x' hex_digit+ bool_literal : 'true' | 'false' char_literal : '‘'char'’' string_literal : '“'char*'”' test.txt : "pineapple" A1_lexer: fragment Delimiter: ' ' | '\t' | '\n' ; fragment Alpha: [a-zA-Z_]; fragment Char: [a-zA-Z0-9] ; fragment Digit: [0-9] ; fragment Alpha_num: Alpha | Digit ; fragment Single_quote: '\'' ; fragment Double_quote: '"' ; fragment Hex_digit: Digit | [a-fA-F] ; fragment Eq_op: '==' | '!=' ; Char_literal : (Single_quote)Char(Single_quote) ; String_literal : (Double_quote)Char*(Double_quote) ; Decimal_literal : Digit+ ; Id: Alpha Alpha_num* ; What I Write in Terminal: grun A1_lexer tokens -tokens test.txt Output in Terminal: line 1:0 token recognition error at: '"' line 1:1 token recognition error at: 'p' line 1:2 token recognition error at: 'ine' line 1:6 token recognition error at: 'p' line 1:7 token recognition error at: 'p' line 1:8 token recognition error at: 'l' line 1:9 token recognition error at: 'e"' [#0,5:5='a',<Id>,1:5] [#1,12:11='<EOF>',<EOF>,2:0] I am really not sure why this is happening.
fragment Char: ['a'-'z'] | ['A' - 'Z'] | ['0' - '9'] ['a'-'z'] doesn't mean "a to z", it means "a single quote, or a, or a single quote to a single quote, or z, or a single quote", which simplifies to just "a single quote, a or z". What you want is just [a-z] without the quotes and the same applies to the other character classes as well - except that they also contain spaces, so it's "single quote, A, single quote, space to space, single quote, Z, or single quote" etc. Also you don't need to "or" character classes, you can just write everything in one character class like this: [a-zA-Z0-9] (like you already did for the Alpha rule). The same applies to the Digit rule as well. Note that it's a bit unusual to only allow these specific characters inside quotes. Usually you'd allow everything that isn't an unescaped quote or an invalid escape sequence. But of course that all depends on the language you're parsing.
ANTLR4 Grammar picks up 'and' and 'or' in variable names
Please help me with my ANTLR4 Grammar. Sample "formel": (Arbejde.ArbejderIKommuneNr=860) and (Arbejde.ErIArbejde = 'J') & (Arbejde.ArbejdsTimerPrUge = 40) (Ansogeren.BorIKommunen = 'J') and (BeregnDato(Ansogeren.Fodselsdato; '+62Å') < DagsDato) (Arb.BorI=860) My problem is that Arb.BorI=860 is not handled correct. I get this error: Error: no viable alternative at input '(Arb.Bor' at linenr/position: 1/6 \r\nException: Der blev udløst en undtagelse af typen 'Antlr4.Runtime.NoViableAltException Please notis that Arb.BorI contains the word 'or'. I think my problem is that my 'booleanOps' in the grammar override 'datakildefelt' So... My problem is how do I get my grammar correct - I am stuck, so any help will be appreciated. My Grammar: grammar UnikFormel; formel : boolExpression # BooleanExpr | expression # Expr | '(' formel ')' # Parentes; boolExpression : ( '(' expression ')' ) ( booleanOps '(' expression ')' )+; expression : element compareOps element # Compare; element : datakildefelt # DatakildeId | function # Funktion | int # Integer | decimal # Real | string # Text; datakildefelt : datakilde '.' felt; datakilde : identifyer; felt : identifyer; function : funktionsnavn ('(' funcParameters? ')')?; funktionsnavn : identifyer; funcParameters : funcParameter (';' funcParameter)*; funcParameter : element; identifyer : LETTER+; int : DIGIT+; decimal : DIGIT+ '.' DIGIT+ | '.' DIGIT+; string : QUOTE .*? QUOTE; booleanOps : (AND | OR); compareOps : (LT | GT | EQ | GTEQ | LTEQ); QUOTE : '\''; OPERATOR: '+'; DIGIT: [0-9]; LETTER: [a-åA-Å]; MUL : '*'; DIV : '/'; ADD : '+'; SUB : '-'; GT : '>'; LT : '<'; EQ : '='; GTEQ : '>='; LTEQ : '<='; AND : '&' | 'and'; OR : '?' | 'or'; WS : ' '+ -> skip;
Rules that come first always have precedence. In your case you need to move AND and OR before LETTER. Also there is the same problem with GTEQ and LTEQ, maybe somewhere else too. EDIT Additionally, you should make identifyer a lexer rule, i.e. start with capital letter (IDENTIFIER or Identifier). The same goes for int, decimal and string. Input is initially a stream of characters and is first processed into a stream of tokens, using only lexer rules. At this point parser rules (those starting with lowercase letter) do not come to play yet. So, to make "BorI" parse as single entity (token), you need to create a lexer rule that matches identifiers. Currently it would be parsed as 3 tokens: LETTER (B) OR (or) LETTER (I).
Thanks for your help. There were multiple problems. Reading the ANTLR4 book and using "TestRig -gui" got me on the right track. The working grammar is: grammar UnikFormel; formel : '(' formel ')' # Parentes | expression # Expr | boolExpression # BooleanExpr ; boolExpression : '(' expression ')' ( booleanOps '(' expression ')' )+ | '(' formel ')' ( booleanOps '(' formel ')' )+; expression : element compareOps element # Compare; datakildefelt : ID '.' ID; function : ID ('(' funcParameters? ')')?; funcParameters : funcParameter (';' funcParameter)*; funcParameter : element; element : datakildefelt # DatakildeId | function # Funktion | INT # Integer | DECIMAL # Real | STRING # Text; booleanOps : (AND | OR); compareOps : ( GTEQ | LTEQ | LT | GT | EQ |); AND : '&' | 'and'; OR : '?' | 'or'; GTEQ : '>='; LTEQ : '<='; GT : '>'; LT : '<'; EQ : '='; ID : LETTER ( LETTER | DIGIT)*; INT : DIGIT+; DECIMAL : DIGIT+ '.' DIGIT+ | '.' DIGIT+; STRING : QUOTE .*? QUOTE; fragment QUOTE : '\''; fragment DIGIT: [0-9]; fragment LETTER: [a-åA-Å]; WS : [ \t\r\n]+ -> skip;
ANTLR4: Correctly matching common prefixes
I have the following ANTLR4 grammar: grammar test; start_symbol: '(FILE' line* ')' EOF ; line: '(' ID ')' ; ID: [a-zA-Z_] [a-zA-Z0-9_]* ; White_space : [ \t\n\r]+ -> skip ; ... and it perfectly works on this sample input file: (FILE (LINE) ) But I also want it to work on: (FILE (FILELINE) ) This does not work. Obviously, the lexer generated an implicit '(FILE' token, which will also match the '(FILELINE' in the second line, which leads to an error. How can I fix this? Bonus: I also want to parse this: (FILE (FILE) ) Thanks :)
Something like this would do it: start : '(' FILE line* ')' EOF; line : '(' id ')'; id : ID | FILE; FILE : 'FILE'; ID : [a-zA-Z_] [a-zA-Z0-9_]*; SPACE : [ \t\n\r]+ -> skip;
Error when generating a grammar for chess PGN files
I made this ANTLR4 grammar in order to parse a PGN inside my Java programm, but I can't manage to solve the ambiguity in it : grammar Pgn; file: game (NEWLINE+ game)*; game: (tag+ NEWLINE+)? notation; tag: [TAG_TYPE "TAG_VALUE"]; notation: move+ END_RESULT?; move: MOVE_NUMBER\. MOVE_DESC MOVE_DESC #CompleteMove | MOVE_NUMBER\. MOVE_DESC #OnlyWhiteMove | MOVE_NUMBER\.\.\. MOVE_DESC #OnlyBlackMove ; END_RESULT: '1-0' | '0-1' | '1/2-1/2' ; TAG_TYPE: LETTER+; TAG_VALUE: .*; MOVE_NUMBER: DIGIT+; MOVE_DESC: .*; NEWLINE: \r? \n; SPACES: [ \t]+ -> skip; fragment LETTER: [a-zA-Z]; fragment DIGIT: [0-9]; And this is the error output : $ antlr4 Pgn.g4 error(50): Pgn.g4:6:6: syntax error: 'TAG_TYPE "TAG_VALUE"' came as a complete surprise to me while matching alternative I think the error come from the fact that " [ ", " ] " and ' " ' can't be used freely, neither in Grammar nor Lexer. Helps or advices are welcome.
Looking at the specs for PGN, http://www.thechessdrum.net/PGN_Reference.txt, I see there's a formal definition of the PGN format there: 18: Formal syntax <PGN-database> ::= <PGN-game> <PGN-database> <empty> <PGN-game> ::= <tag-section> <movetext-section> <tag-section> ::= <tag-pair> <tag-section> <empty> <tag-pair> ::= [ <tag-name> <tag-value> ] <tag-name> ::= <identifier> <tag-value> ::= <string> <movetext-section> ::= <element-sequence> <game-termination> <element-sequence> ::= <element> <element-sequence> <recursive-variation> <element-sequence> <empty> <element> ::= <move-number-indication> <SAN-move> <numeric-annotation-glyph> <recursive-variation> ::= ( <element-sequence> ) <game-termination> ::= 1-0 0-1 1/2-1/2 * <empty> ::= I highly recommend you to let your ANTLR grammar resemble that as much as possible. I made a small project with ANTLR 4 on Github which you can try out: https://github.com/bkiers/PGN-parser The grammar (without comments): parse : pgn_database EOF ; pgn_database : pgn_game* ; pgn_game : tag_section movetext_section ; tag_section : tag_pair* ; tag_pair : LEFT_BRACKET tag_name tag_value RIGHT_BRACKET ; tag_name : SYMBOL ; tag_value : STRING ; movetext_section : element_sequence game_termination ; element_sequence : (element | recursive_variation)* ; element : move_number_indication | san_move | NUMERIC_ANNOTATION_GLYPH ; move_number_indication : INTEGER PERIOD? ; san_move : SYMBOL ; recursive_variation : LEFT_PARENTHESIS element_sequence RIGHT_PARENTHESIS ; game_termination : WHITE_WINS | BLACK_WINS | DRAWN_GAME | ASTERISK ; WHITE_WINS : '1-0' ; BLACK_WINS : '0-1' ; DRAWN_GAME : '1/2-1/2' ; REST_OF_LINE_COMMENT : ';' ~[\r\n]* -> skip ; BRACE_COMMENT : '{' ~'}'* '}' -> skip ; ESCAPE : {getCharPositionInLine() == 0}? '%' ~[\r\n]* -> skip ; SPACES : [ \t\r\n]+ -> skip ; STRING : '"' ('\\\\' | '\\"' | ~[\\"])* '"' ; INTEGER : [0-9]+ ; PERIOD : '.' ; ASTERISK : '*' ; LEFT_BRACKET : '[' ; RIGHT_BRACKET : ']' ; LEFT_PARENTHESIS : '(' ; RIGHT_PARENTHESIS : ')' ; LEFT_ANGLE_BRACKET : '<' ; RIGHT_ANGLE_BRACKET : '>' ; NUMERIC_ANNOTATION_GLYPH : '$' [0-9]+ ; SYMBOL : [a-zA-Z0-9] [a-zA-Z0-9_+#=:-]* ; SUFFIX_ANNOTATION : [?!] [?!]? ; UNEXPECTED_CHAR : . ; For a version with comments, see: https://github.com/bkiers/PGN-parser/blob/master/src/main/antlr4/nl/bigo/pp/PGN.g4